We strongly recommend to use headphones to listen to the samples.
Real World Speech Examples
Example1
Spectrogram of the Voice Mixture
Target Speaker
VoViT (Ours)
Visual Voice [1]
Example2
Spectrogram of the Voice Mixture
Target Speaker
VoViT (Ours)
Visual Voice [1]
Singing Voice Examples
Aicha Song Cover
Target Voice: Lead Singer (center)
Mixture
VoViT (Ours)
Y-Net-graph [1]
Beatbox Example
Target Voice: Beatbox
Mixture
VoViT (Ours)
Y-Net-graph [1]
Uptown Girl
Target Voice: Lead voice (center)
Mixture
VoViT (Ours)
Y-Net-graph [1]
Aretha Franklin - Respect #1
Target Voice: Lead voice (center)
Mixture
VoViT (Ours)
Y-Net-graph [1]
Aretha Franklin - Respect #2
Target Voice: Lead voice (center)
Mixture
VoViT (Ours)
Y-Net-graph [1]
Aretha Franklin - Respect #3
Target Voice: Lead voice (center)
Mixture
VoViT (Ours)
Y-Net-graph [1]
References
[1] Gao & Grauman, “VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency,” in CVPR, 2021.
[2] Montesinos et al., “A cappella: Audio-visual Singing Voice Separation,” in BMVC, 2021.