We strongly recommend to use headphones to listen to the samples.
Real World Speech Examples
Example1
Spectrogram of the Voice Mixture

Target Speaker
VoViT (Ours)

Visual Voice [1]



Example2
Spectrogram of the Voice Mixture

Target Speaker
VoViT (Ours)

Visual Voice [1]



Singing Voice Examples
Aicha Song Cover
Target Voice: Lead Singer (center)
Mixture

VoViT (Ours)

Y-Net-graph [1]

Beatbox Example
Target Voice: Beatbox
Mixture

VoViT (Ours)

Y-Net-graph [1]

Uptown Girl
Target Voice: Lead voice (center)
Mixture

VoViT (Ours)

Y-Net-graph [1]

Aretha Franklin - Respect #1
Target Voice: Lead voice (center)
Mixture

VoViT (Ours)

Y-Net-graph [1]

Aretha Franklin - Respect #2
Target Voice: Lead voice (center)
Mixture

VoViT (Ours)

Y-Net-graph [1]

Aretha Franklin - Respect #3
Target Voice: Lead voice (center)
Mixture

VoViT (Ours)

Y-Net-graph [1]

References
[1] Gao & Grauman, “VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency,” in CVPR, 2021.
[2] Montesinos et al., “A cappella: Audio-visual Singing Voice Separation,” in BMVC, 2021.