The Y-Net

The Y-Net model consist of a U-Net conditioned with visual features.

Y-Net

The system works with chunks of 4n seconds, where n ∈ N. The audio network takes as input a 256×16T n complex spectrogram and returns a complex mask. The visual network in case of Y-Net-m and Y-Net-mr, is the video network (in red), which takes as input a set of 100n frames cropped around the mouth of the target singer. In case of Y-Net-g and Y-Net-gr, the visual network is the graph network (in green) which takes as input a sequence of 68n landmarks of the face of the target singer. The visual features are fused with the audio network’s latent space through a FiLM layer (we use T=16). The FiLM broadcasts the 256×1×T visual features into the 256×16×T audio ones. The spatial blocks of the U-Net downsample in both, the frequency and the temporal dimension, while the frequential block downsamples along the frequency dimension only.

Detailed layers are shown below.

U-Net structure

Block #	Type of block	Output channels	Kernel	Padding	Output shape
1	Spatial	32	5 x 5	2	128 x 128
2	Spatial	64	5 x 5	2	64 x 64
3	Spatial	128	5 x 5	2	32 x 32
4	Spatial	256	5 x 5	2	16 x 16
5	Frequential	256	5 x 5	2	8 x 16
6	Frequential	256	5 x 5	2	4 x 16
-	Bottleneck	256	3 x 3	1	8 x 16

Video Network structure

Block #	Type of block	Output channels	Kernel	Padding	Stride	Output shape
0	Basic Stem	64	3 x 7 x 7	1 x 2 x 2	1 x 3 x 3	100 x 100
1	Spatio-temporal	64	3 x 3 x 3	1 x 1 x 1	1 x 1 x 1	100 x 100
2	Spatial	128	3 x 3	1 x 1	2 x 2	100 x 100
3	Spatial	256	3 x 3	1 x 1	2 x 2	100 x 100

Graph Network structure

Block #	Output channels	Output shape
1	32	100 x 68
2	32	100 x 68
3	64	50 x 68
4	64	50 x 68
5	128	25 x 68
6	128	25 x 68
7	256	13 x 68
8	256	13 x 68