We strongly recommend to use headphones to listen to the samples.

Real World Speech Examples

Example1

Spectrogram of the Voice Mixture

Target Speaker

VoViT (Ours)

Visual Voice [1]

Example2

Spectrogram of the Voice Mixture

Target Speaker

VoViT (Ours)

Visual Voice [1]

Singing Voice Examples

Aicha Song Cover

Target Voice: Lead Singer (center)

Mixture

VoViT (Ours)

Y-Net-graph [1]

Beatbox Example

Target Voice: Beatbox

Mixture

VoViT (Ours)

Y-Net-graph [1]

Uptown Girl

Target Voice: Lead voice (center)

Mixture

VoViT (Ours)

Y-Net-graph [1]

Aretha Franklin - Respect #1

Target Voice: Lead voice (center)

Mixture

VoViT (Ours)

Y-Net-graph [1]

Aretha Franklin - Respect #2

Target Voice: Lead voice (center)

Mixture

VoViT (Ours)

Y-Net-graph [1]

Aretha Franklin - Respect #3

Target Voice: Lead voice (center)

Mixture

VoViT (Ours)

Y-Net-graph [1]

References

[1] Gao & Grauman, “VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency,” in CVPR, 2021.
[2] Montesinos et al., “A cappella: Audio-visual Singing Voice Separation,” in BMVC, 2021.