The Y-Net model consist of a U-Net conditioned with visual features.
The system works with chunks of 4n seconds, where n ∈ N. The audio network takes as input a 256×16T n complex spectrogram and returns a complex mask. The visual network in case of Y-Net-m and Y-Net-mr, is the video network (in red), which takes as input a set of 100n frames cropped around the mouth of the target singer. In case of Y-Net-g and Y-Net-gr, the visual network is the graph network (in green) which takes as input a sequence of 68n landmarks of the face of the target singer. The visual features are fused with the audio network’s latent space through a FiLM layer (we use T=16). The FiLM broadcasts the 256×1×T visual features into the 256×16×T audio ones. The spatial blocks of the U-Net downsample in both, the frequency and the temporal dimension, while the frequential block downsamples along the frequency dimension only.
Detailed layers are shown below.
Detailed layers are shown below.
U-Net structure
Block # | Type of block | Output channels | Kernel | Padding | Output shape |
1 | Spatial | 32 | 5 x 5 | 2 | 128 x 128 |
2 | Spatial | 64 | 5 x 5 | 2 | 64 x 64 |
3 | Spatial | 128 | 5 x 5 | 2 | 32 x 32 |
4 | Spatial | 256 | 5 x 5 | 2 | 16 x 16 |
5 | Frequential | 256 | 5 x 5 | 2 | 8 x 16 |
6 | Frequential | 256 | 5 x 5 | 2 | 4 x 16 |
- | Bottleneck | 256 | 3 x 3 | 1 | 8 x 16 |
Video Network structure
Block # | Type of block | Output channels | Kernel | Padding | Stride | Output shape |
0 | Basic Stem | 64 | 3 x 7 x 7 | 1 x 2 x 2 | 1 x 3 x 3 | 100 x 100 |
1 | Spatio-temporal | 64 | 3 x 3 x 3 | 1 x 1 x 1 | 1 x 1 x 1 | 100 x 100 |
2 | Spatial | 128 | 3 x 3 | 1 x 1 | 2 x 2 | 100 x 100 |
3 | Spatial | 256 | 3 x 3 | 1 x 1 | 2 x 2 | 100 x 100 |
Graph Network structure
Block # | Output channels | Output shape |
1 | 32 | 100 x 68 |
2 | 32 | 100 x 68 |
3 | 64 | 50 x 68 |
4 | 64 | 50 x 68 |
5 | 128 | 25 x 68 |
6 | 128 | 25 x 68 |
7 | 256 | 13 x 68 |
8 | 256 | 13 x 68 |