The Y-Net model consist of a U-Net conditioned with visual features.

Y-Net

The system works with chunks of 4n seconds, where n ∈ N. The audio network takes as input a 256×16T n complex spectrogram and returns a complex mask. The visual network in case of Y-Net-m and Y-Net-mr, is the video network (in red), which takes as input a set of 100n frames cropped around the mouth of the target singer. In case of Y-Net-g and Y-Net-gr, the visual network is the graph network (in green) which takes as input a sequence of 68n landmarks of the face of the target singer. The visual features are fused with the audio network’s latent space through a FiLM layer (we use T=16). The FiLM broadcasts the 256×1×T visual features into the 256×16×T audio ones. The spatial blocks of the U-Net downsample in both, the frequency and the temporal dimension, while the frequential block downsamples along the frequency dimension only.

Detailed layers are shown below.

U-Net structure

Block # Type of block Output channels Kernel Padding Output shape
1 Spatial 32 5 x 5 2 128 x 128
2 Spatial 64 5 x 5 2 64 x 64
3 Spatial 128 5 x 5 2 32 x 32
4 Spatial 256 5 x 5 2 16 x 16
5 Frequential 256 5 x 5 2 8 x 16
6 Frequential 256 5 x 5 2 4 x 16
- Bottleneck 256 3 x 3 1 8 x 16


Video Network structure

Block # Type of block Output channels Kernel Padding Stride Output shape
0 Basic Stem 64 3 x 7 x 7 1 x 2 x 2 1 x 3 x 3 100 x 100
1 Spatio-temporal 64 3 x 3 x 3 1 x 1 x 1 1 x 1 x 1 100 x 100
2 Spatial 128 3 x 3 1 x 1 2 x 2 100 x 100
3 Spatial 256 3 x 3 1 x 1 2 x 2 100 x 100


Graph Network structure

Block # Output channels Output shape
1 32 100 x 68
2 32 100 x 68
3 64 50 x 68
4 64 50 x 68
5 128 25 x 68
6 128 25 x 68
7 256 13 x 68
8 256 13 x 68