Synthesized speech: 1600 ms

Speaker 30: brip5p

GT

Corrupted Input

Morrone et al. [1]

Ours, audio-visual

Ours, audio-only

Speaker 30: bgbr5p

GT

Corrupted Input

Morrone et al. [1]

Ours, audio-visual

Ours, audio-only

Speaker 32: brbx4s

GT

Corrupted Input

Morrone et al. [1]

Ours, audio-visual

Ours, audio-only

In this example we can see that Morrone's model predicts "been red by l four soon", while our proposed model predicts "been red by x four soon". This shows the ambiguity between some visemes and the corresponding phonemes described in the limitations section of the paper.

Synthesized speech: 800 ms

Speaker 32: brbx4s

GT

Corrupted Input

Morrone et al. [1]

Ours, audio-visual

Ours, audio-only

In this example, we can observate how the audio-only baseline is attempting to inpaint the corrupted segment with a sentece learned from the dataset.

Speaker 30: pwwb1n

GT

Corrupted Input

Morrone et al. [1]

Ours, audio-visual

Ours, audio-only

Synthesized speech: 400 ms

Speaker 32: lgii6a

GT

Corrupted Input

Morrone et al. [1]

Ours, audio-visual

Ours, audio-only

Synthesized speech: 160 ms

Speaker 34: swie4s

GT

Corrupted Input

Morrone et al. [1]

Ours, audio-visual

Ours, audio-only

Voxceleb unseen-unheard: 1600 ms

GT

Corrupted Input

Ours, audio-visual

GT

Corrupted Input

Ours, audio-visual

GT

Corrupted Input

Ours, audio-visual