Synthesized speech: 1600 ms
Speaker 30: brip5p
GT
![](../estimated_demo/0_mask100_cst/gt.png)
Corrupted Input
![](../estimated_demo/0_mask100_cst/input.png)
Morrone et al. [1]
![](../estimated_demo/0_mask100_cst/morrone.png)
Ours, audio-visual
![](../estimated_demo/0_mask100_cst/ours.png)
Ours, audio-only
![](../estimated_demo/0_mask100_cst/ao.png)
Speaker 30: bgbr5p
GT
![](../estimated_demo/1118_mask100_cst/gt.png)
Corrupted Input
![](../estimated_demo/1118_mask100_cst/input.png)
Morrone et al. [1]
![](../estimated_demo/1118_mask100_cst/morrone.png)
Ours, audio-visual
![](../estimated_demo/1118_mask100_cst/ours.png)
Ours, audio-only
![](../estimated_demo/1118_mask100_cst/ao.png)
Speaker 32: brbx4s
GT
![](../estimated_demo/1907_mask100_cst/gt.png)
Corrupted Input
![](../estimated_demo/1907_mask100_cst/input.png)
Morrone et al. [1]
![](../estimated_demo/1907_mask100_cst/morrone.png)
Ours, audio-visual
![](../estimated_demo/1907_mask100_cst/ours.png)
Ours, audio-only
![](../estimated_demo/1907_mask100_cst/ao.png)
In this example we can see that Morrone's model predicts "been red by l four soon", while our proposed model predicts "been red by x four soon". This shows the ambiguity between some visemes and the corresponding phonemes described in the limitations section of the paper.
Synthesized speech: 800 ms
Speaker 32: brbx4s
GT
![](../estimated_demo/1907_mask50_cst/gt.png)
Corrupted Input
![](../estimated_demo/1907_mask50_cst/input.png)
Morrone et al. [1]
![](../estimated_demo/1907_mask50_cst/morrone.png)
Ours, audio-visual
![](../estimated_demo/1907_mask50_cst/ours.png)
Ours, audio-only
![](../estimated_demo/1907_mask50_cst/ao.png)
In this example, we can observate how the audio-only baseline is attempting to inpaint the corrupted segment with a sentece learned from the dataset.
Speaker 30: pwwb1n
GT
![](../estimated_demo/1960_mask50_cst/gt.png)
Corrupted Input
![](../estimated_demo/1960_mask50_cst/input.png)
Morrone et al. [1]
![](../estimated_demo/1960_mask50_cst/morrone.png)
Ours, audio-visual
![](../estimated_demo/1960_mask50_cst/ours.png)
Ours, audio-only
![](../estimated_demo/1960_mask50_cst/ao.png)
Synthesized speech: 400 ms
Speaker 32: lgii6a
GT
![](../estimated_demo/658_mask25_cst/gt.png)
Corrupted Input
![](../estimated_demo/658_mask25_cst/input.png)
Morrone et al. [1]
![](../estimated_demo/658_mask25_cst/morrone.png)
Ours, audio-visual
![](../estimated_demo/658_mask25_cst/ours.png)
Ours, audio-only
![](../estimated_demo/658_mask25_cst/ao.png)
Synthesized speech: 160 ms
Speaker 34: swie4s
GT
![](../estimated_demo/2065_mask10_cst/gt.png)
Corrupted Input
![](../estimated_demo/2065_mask10_cst/input.png)
Morrone et al. [1]
![](../estimated_demo/2065_mask10_cst/morrone.png)
Ours, audio-visual
![](../estimated_demo/2065_mask10_cst/ours.png)
Ours, audio-only
![](../estimated_demo/2065_mask10_cst/ao.png)
Voxceleb unseen-unheard: 1600 ms
GT
![](../estimated_demo/157_mask100_cst/gt.png)
Corrupted Input
![](../estimated_demo/157_mask100_cst/input.png)
Ours, audio-visual
![](../estimated_demo/157_mask100_cst/ours.png)
GT
![](../estimated_demo/24_mask100_cst/gt.png)
Corrupted Input
![](../estimated_demo/24_mask100_cst/input.png)
Ours, audio-visual
![](../estimated_demo/24_mask100_cst/ours.png)
GT
![](../estimated_demo/30_mask100_cst/gt.png)
Corrupted Input
![](../estimated_demo/30_mask100_cst/input.png)
Ours, audio-visual
![](../estimated_demo/30_mask100_cst/ours.png)