VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

Code + Weights
Accepted in ECCV 2022


This paper presents an audio-visual approach for voice separation which outperforms state-of-the- art methods at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights are publicly available at https://ipcv.github.io/VoViT/


@inproceedings{montesinos2022vovit, author = {Montesinos, Juan F. and Kadandale, Venkatesh S. and Haro, Gloria}, title = {VoViT: Low Latency Graph-Based Audio-Visual Voice Separation Transformer}, year = {2022}, isbn = {978-3-031-19835-9}, publisher = {Springer-Verlag}, doi = {10.1007/978-3-031-19836-6_18}, booktitle = {Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII}, pages = {310–326}, } }

Video presentation at ECCV 2022


The authors acknowledge support by MICINN/FEDER UE project, ref. PGC2018-098625-B-I00; PID2021-127643NB-I00 project; H2020-MSCA-RISE-2017 project, ref. 777826 NoMADS; ReAViPeRo network, ref. RED2018-102511-T; and Spanish Ministry of Economy and Competitiveness under the María de Maeztu Units of Excellence Program (MDM-2015-0502) and the Social European Funds. J. F. M. acknowledges support by FPI scholarship PRE2018-083920. V. S. K. has received financial support through “la Caixa” Foundation (ID 100010434), fellowship code: LCF/BQ/DI18/11660064. V.S.K has also received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie SkłodowskaCurie grant agreement No. 713673. We gratefully acknowledge NVIDIA Corporation for the donation of GPUs used for the experiments.