Synthesized speech: 1600 ms
Speaker 30: brip5p
GT

Corrupted Input

Morrone et al. [1]

Ours, audio-visual

Ours, audio-only

Speaker 30: bgbr5p
GT

Corrupted Input

Morrone et al. [1]

Ours, audio-visual

Ours, audio-only

Speaker 32: brbx4s
GT

Corrupted Input

Morrone et al. [1]

Ours, audio-visual

Ours, audio-only

In this example we can see that Morrone's model predicts "been red by l four soon", while our proposed model predicts "been red by x four soon". This shows the ambiguity between some visemes and the corresponding phonemes described in the limitations section of the paper.
Synthesized speech: 800 ms
Speaker 32: brbx4s
GT

Corrupted Input

Morrone et al. [1]

Ours, audio-visual

Ours, audio-only

In this example, we can observate how the audio-only baseline is attempting to inpaint the corrupted segment with a sentece learned from the dataset.
Speaker 30: pwwb1n
GT

Corrupted Input

Morrone et al. [1]

Ours, audio-visual

Ours, audio-only

Synthesized speech: 400 ms
Speaker 32: lgii6a
GT

Corrupted Input

Morrone et al. [1]

Ours, audio-visual

Ours, audio-only

Synthesized speech: 160 ms
Speaker 34: swie4s
GT

Corrupted Input

Morrone et al. [1]

Ours, audio-visual

Ours, audio-only

Voxceleb unseen-unheard: 1600 ms
GT

Corrupted Input

Ours, audio-visual

GT

Corrupted Input

Ours, audio-visual

GT

Corrupted Input

Ours, audio-visual
