A cappella: Audio-visual Singing Voice Separation

Paper
Code + Weights
Dataset
Demos
The paper has been accepted to BMVC 2021!

Abstract

The task of isolating a target singing voice in music videos has useful applications. In this work, we explore the single-channel singing voice separation problem from a multimodal perspective, by jointly learning from audio and visual modalities. To do so, we present Acappella, a dataset spanning around 46 hours of a cappella solo singing videos sourced from YouTube. We also propose an audio-visual convolutional network based on graphs which achieves state-of-the-art singing voice separation results on our dataset and compare it against its audio-only counterpart, U-Net, and a state-of-the-art audiovisual speech separation model. We evaluate the models in the following challenging setups: i) presence of overlapping voices in the audio mixtures, ii) the target voice set to lower volume levels in the mix, and iii) combination of i) and ii). The third one being the most challenging evaluation setup. We demonstrate that our model outperforms the baseline models in the singing voice separation task in the most challenging evaluation setup. The code, the pre-trained models, and the dataset are publicly available at https://ipcv.github.io/Acappella/


*

  These authors contributed equally to this work.


Citation

@inproceedings{montesinos2021cappella, title={A cappella: Audio-visual Singing Voice Separation}, author={Montesinos, Juan F and Kadandale, Venkatesh S and Haro, Gloria}, booktitle={32nd British Machine Vision Conference, BMVC 2021}, year={2021} }


Acknowledgements

The authors acknowledge support by MICINN/FEDER UE project, ref. PGC2018-098625-B-I00; H2020-MSCA-RISE-2017 project, ref. 777826 NoMADS; ReAViPeRo network, ref. RED2018-102511-T; and Spanish Ministry of Economy and Competitiveness under the María de Maeztu Units of Excellence Program (MDM-2015-0502) and the Social European Funds. J. F. M. acknowledges support by FPI scholarship PRE2018-083920. V. S. K. has received financial support through “la Caixa” Foundation (ID 100010434), fellowship code: LCF/BQ/DI18/11660064. V.S.K has also received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie SkłodowskaCurie grant agreement No. 713673. We gratefully acknowledge NVIDIA Corporation for the donation of GPUs used for the experiments. We thank Emilia Gómez and Olga Slizovskaia for insightful discussions on the subject.