Synthesized speech: 1600 ms

Speaker 30: brip5p
GT
Corrupted Input
Morrone et al. [1]
Ours, audio-visual
Ours, audio-only
Speaker 30: bgbr5p
GT
Corrupted Input
Morrone et al. [1]
Ours, audio-visual
Ours, audio-only
Speaker 32: brbx4s
GT
Corrupted Input
Morrone et al. [1]
Ours, audio-visual
Ours, audio-only

In this example we can see that Morrone's model predicts "been red by l four soon", while our proposed model predicts "been red by x four soon". This shows the ambiguity between some visemes and the corresponding phonemes described in the limitations section of the paper.



Synthesized speech: 800 ms

Speaker 32: brbx4s
GT
Corrupted Input
Morrone et al. [1]
Ours, audio-visual
Ours, audio-only

In this example, we can observate how the audio-only baseline is attempting to inpaint the corrupted segment with a sentece learned from the dataset.

Speaker 30: pwwb1n
GT
Corrupted Input
Morrone et al. [1]
Ours, audio-visual
Ours, audio-only


Synthesized speech: 400 ms

Speaker 32: lgii6a
GT
Corrupted Input
Morrone et al. [1]
Ours, audio-visual
Ours, audio-only


Synthesized speech: 160 ms

Speaker 34: swie4s
GT
Corrupted Input
Morrone et al. [1]
Ours, audio-visual
Ours, audio-only

Voxceleb unseen-unheard: 1600 ms

GT
Corrupted Input
Ours, audio-visual
GT
Corrupted Input
Ours, audio-visual
GT
Corrupted Input
Ours, audio-visual