VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

Code + Weights
Accepted in ECCV 2022


This paper presents an audio-visual approach for voice separation which outperforms state-of-the- art methods at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights will be made publicly available at https://ipcv.github.io/VoViT/


@inproceedings{montesinos2022vovit, title={VoVIT: Low Latency Graph-Based Audio-Visual Voice Sseparation Transformer}, author={Montesinos, Juan F. and Kadandale, Venkatesh S. and Haro, Gloria}, booktitle={Arxiv preprint arXiv:2203.04099}, year={2022} }


The authors acknowledge support by MICINN/FEDER UE project, ref. PGC2018-098625-B-I00; H2020-MSCA-RISE-2017 project, ref. 777826 NoMADS; ReAViPeRo network, ref. RED2018-102511-T; and Spanish Ministry of Economy and Competitiveness under the María de Maeztu Units of Excellence Program (MDM-2015-0502) and the Social European Funds. J. F. M. acknowledges support by FPI scholarship PRE2018-083920. V. S. K. has received financial support through “la Caixa” Foundation (ID 100010434), fellowship code: LCF/BQ/DI18/11660064. V.S.K has also received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie SkłodowskaCurie grant agreement No. 713673. We gratefully acknowledge NVIDIA Corporation for the donation of GPUs used for the experiments.