Speech Inpainting: Context-based Speech Synthesis Guided by Video

Paper

Code + Weights

Demos

Accepted to interspeech 2023!

Abstract

Audio and visual modalities are inherently connected in speech signals: Lips movements and facial expressions are correlated with the production of speech sounds. This moti- vates the studies that incorporate visual modality to enhance an acoustic speech signal or even restore missing audio in- formation. In this paper, we present a transformer-based deep learning model which produces state-of-the-art results in audio-visual speech inpainting. Given an audio-visual sig- nal whose audio stream is partially corrupted, audio-visual speech inpainting is the task of synthesizing the audio of the corrupted segment coherently to the corresponding video and the uncorrupted audio. We compare the performance of our model against the previous state-of-the-art model and audio-only baselines, showing the importance of having an additional cue that provides information about the content of the corrupted audio. We also show how the visual features from AV-Hubert [1] are suitable for synthesizing speech.

Acknowledgements

Acknowledges support by FPI grant PRE2018-083920 and the MICINN/FEDER UE project PID2021-127643NB-I00

Speech Inpainting: Context-based Speech Synthesis Guided by Video

Juan F. Montesinos¹

Daniel Michelsanti^2,3

Gloria Haro¹

Zheng-Hua Tan²

Jesper Jensen^2,3

Paper

Code + Weights

Demos

Accepted to interspeech 2023!

Abstract

Acknowledgements

Speech Inpainting: Context-based Speech Synthesis Guided by Video

Juan F. Montesinos1

Daniel Michelsanti2,3

Gloria Haro1

Zheng-Hua Tan2

Jesper Jensen2,3

Paper

Code + Weights

Demos

Accepted to interspeech 2023!

Abstract

Acknowledgements

Juan F. Montesinos¹

Daniel Michelsanti^2,3

Gloria Haro¹

Zheng-Hua Tan²

Jesper Jensen^2,3