Dear reader, the YouTube videos will start in the segment of interest. The only way to control the video is through the "Play" and "Stop" buttons that we have provided. The recommended way of visualizing the demos is to pay attention to the target singer's lip movements and then to listen to the estimated samples. Note: Please be aware that, sometimes, the perceivable differences in the singing voice estimates across different models are subtle. We strongly recommend the use of headphones while listening to them. We also would like to draw attention to the spectrograms for visual cues that might better help perceive these differences.
Real World Examples
Here, we show some demos of how our models perform in estimating the desired target voice in real world videos (taken from YouTube). Note that none of these singers were a part of the training set.
Paranoid Android
Target Voice: Bottom-center Face
Uptown Girl
Target Voice: Lead voice (center)
Disney Medley
Target Voice: Center Face
Jackson5 Cover #1
Target Voice: Lead Singer (center)
Jackson5 Cover #2
Target Voice: Lead Singer (center)
Aicha
Target Voice: Lead Singer (center)
Bad Guy
Target Voice: Female Singer
Our Multi-voice Montage
The videos from YouTube that we saw earlier do not provide the original isolated voices. For this reason, we could only qualitatively evaluate the performance of target voice separation models on such videos. To be able to evaluate quantitatively, we recorded a multi-voice montage with an amateur singer so that we can use the individual voices as ground truth during evaluation. It is a montage with six voices and an accompaniment track. It is a very challenging example because the same singer is shown singing multiple voices at the same time. The song being performed is "The Circle of Life", a popular soundtrack from the movie "The Lion King". Apart from English, the song lyrics also contains Zulu: a language that was not a part of the training set. Also, this singer was not a part of the training set. Here, we estimate the lead voice Rafiki (bottom-center). See the right column in Table 2 of the paper for the quantitative metrics.
The Circle Of Life
Below are the estimates for the target voice Rafiki (sung by the bottom-center face). After 6s (see the ground truth spectrogram), this voice is reduced to silence. From the spectrograms, it is clearly evident that our Y-Net captures the silence in this region better than the other models. Y-Net-r seems to be the best performing model here.
Target Voice: Rafiki (Bottom-center Face)
Artificial Mixtures (Test Set)
The artificial mixtures are created by randomly mixing a singing voice with another singing voice along with an accompaniment. The two singing voices are a part of our dataset Acappella. The accompaniment is an audio segment randomly sampled from a subset of AudioSet. The mixing is random in the sense that we do not put any constraints regarding the musical appropriateness or aesthetics in the mixing criteria. Hence, the resulting mixtures may not sound like that of a professional music video. We do this so as to report metrics on the singing voice separation quality, which are possible in this case because we have the ground-truth sources (individual singing voice) unlike in the real world examples we saw earlier. Here, we show some results on estimating the desired target voice in such artificial mixtures. For brevity, we only show the singing face of the target voice in the examples below. Again, note that none of these singers were a part of the training set.
Unseen-Unheard Singer (English)
Unseen-Unheard Singer (New Language)
Failure Cases
Here, we show a failure case where our models typically fail. Our method relies on lip motion to separate target singing voice. There are many music viddemos.htmleos where a group of singers sing perfectly in unison or harmonize maintaining the same lip movements across all the singers. In such settings, our methods fail. We leave it to future work how to address target voice separation in such settings.
Male-Female Unison
Target Voice: Male Singer (left)
In this example, we are trying to separate the male voice. None of our models are able to estimate the male voice correctly in this case. As one can see in the video, the male and female voices are both perfectly synced in terms of lip movements.
References
[1] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” in SIGGRAPH, 2018.