We strongly recommend to use headphones to listen to the samples.


Real World Speech Examples

Example1

Spectrogram of the Voice Mixture

Target Speaker
VoViT (Ours)
Visual Voice [1]


Example2

Spectrogram of the Voice Mixture

Target Speaker
VoViT (Ours)
Visual Voice [1]


Singing Voice Examples

Aicha Song Cover


Target Voice: Lead Singer (center)
Mixture
VoViT (Ours)
Y-Net-graph [1]


Beatbox Example


Target Voice: Beatbox
Mixture
VoViT (Ours)
Y-Net-graph [1]


Uptown Girl


Target Voice: Lead voice (center)
Mixture
VoViT (Ours)
Y-Net-graph [1]


Aretha Franklin - Respect #1


Target Voice: Lead voice (center)
Mixture
VoViT (Ours)
Y-Net-graph [1]


Aretha Franklin - Respect #2


Target Voice: Lead voice (center)
Mixture
VoViT (Ours)
Y-Net-graph [1]


Aretha Franklin - Respect #3


Target Voice: Lead voice (center)
Mixture
VoViT (Ours)
Y-Net-graph [1]


References

[1] Gao & Grauman, “VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency,” in CVPR, 2021.
[2] Montesinos et al., “A cappella: Audio-visual Singing Voice Separation,” in BMVC, 2021.