Dear reader, the YouTube videos will start in the segment of interest. The only way to control the video is through the "Play" and "Stop" buttons that we have provided. The recommended way of visualizing the demos is to pay attention to the target singer's lip movements and then to listen to the estimated samples.
Note: Please be aware that, sometimes, the perceivable differences in the singing voice estimates across different models are subtle. We strongly recommend the use of headphones while listening to them. We also would like to draw attention to the spectrograms for visual cues that might better help perceive these differences.


Examples for Singing Voice Separation Application

Here, we show some demos of how our models perform in estimating the desired target voice in real world videos (taken from YouTube). Note that none of these singers were a part of the training set.


Love of My Life (Queen)


Target Voice: Lead Singer (top-right)
Mixture
Y-Net-mr [1]
Y-Net-mr-V (Ours)

Observation:
In the last 3 seconds where it goes "And Now You Leave", there are leakages of other voices in the estimate of Y-Net-mr to a point that the lyrics are not very intelligible in this estimate. Our model isolates the target voice better than Y-Net-mr and also improves the intelligibility of the lyrics.



Respect (Aretha Franklin)


Target Voice: Lead voice
Mixture
Y-Net-mr [1]
Y-Net-mr-V (Ours)

Observation:
Note that the Y-Net-mr estimate in the above example contains leakage of background vocals around the last second of audio, while our estimate doesn't have this leakage.



The Show Must Go On (Queen)


Target Voice: Lead Singer (center)
Mixture
Y-Net-mr [1]
Y-Net-mr-V (Ours)

Observation:
Note that the voice estimated by Y-Net-mr estimate in the above example is distorted more than the estimate of our model in the last second where the word "scar". The word "scar" is more intelligible in our estimate.



Aicha (Cheb Khaled)


Target Voice: Lead Singer (center)
Mixture
Y-Net-mr [1]
Y-Net-mr-V (Ours)

Observation:
Note that the sound clarity of the voice in our estimate (especially in the beginning) is clearly better than the Y-Net-mr estimate in the above example. Also, our model better suppresses the interferences of backing vocals around the interval between 1s to 2s (better seen in the spectrograms around this time interval).



References

[1] J. F. Montesinos, V. S. Kadandale, and G. Haro, “A cappella: Audio-visual Singing Voice Separation,” in BMVC, 2021.