MuteSwap: Silent Face-based Voice Conversion
Abstract
Conventional voice conversion relies on audio input to modify voice characteristics from a source speaker to a target speaker. However, this process becomes infeasible when clean audio is unavailable, such as in silent videos or noisy environments. To overcome this limitation, we introduce Silent Face-based Voice Conversion (SFVC), the first approach to achieve speech conversion exclusively from visual inputs. Given only target speaker images and a silent lip video, SFVC reconstructs speech that simultaneously preserves the spoken content conditioned on the target speaker’s identity, without any audio input. However, as this task requires generating intelligible speech and converting identity using only visual cues, it is particularly challenging.
To address this, we introduce the first approach for SFVC, MuteSwap, a novel single-stage framework that employs contrastive learning for cross-modality identity alignment and mutual information minimization for disentangling shared visual features. Experimental objective and subjective results show that MuteSwap achieves impressive performance in both speech synthesis and identity conversion, especially under noisy conditions where methods dependent on audio input fail to produce intelligible results, demonstrating both the effectiveness of our training approach and the feasibility of SFVC. Code and parameters will be made publicly available upon acceptance of this paper.
Comparison with FVC
MuteSwap (Identity Converted) takes the image source from reference video, while MuteSwap (Vanilla Synthesis) takes the image source just from the source video.
Sample 1
FVC | SFVC | Reference Video | ||
Source Video | Source Video (Noised) | Silent Lip | Image Source | |
Inputs | ||||
FVMVC | MuteSwap (Identity Converted) |
|||
SP-FaceVC | MuteSwap (Vanilla Synthesis) |
Sample 2
FVC | SFVC | Reference Video | ||
Source Video | Source Video (Noised) | Silent Lip | Image Source | |
Inputs | ||||
FVMVC | MuteSwap (Identity Swapped) |
|||
SP-FaceVC | MuteSwap (Vanilla Synthesis) |
Sample 3
FVC | SFVC | Reference Video | ||
Source Video | Source Video (Noised) | Silent Lip | Image Source | |
Inputs | ||||
FVMVC | MuteSwap (Identity Swapped) |
|||
SP-FaceVC | MuteSwap (Vanilla Synthesis) |
Sample 4
FVC | SFVC | Reference Video | ||
Source Video | Source Video (Noised) | Silent Lip | Image Source | |
Inputs | ||||
FVMVC | MuteSwap (Identity Swapped) |
|||
SP-FaceVC | MuteSwap (Vanilla Synthesis) |
Converted Examples on LRS3
The model learns to synthesize speech from silent lip videos conditioned on facial images.
Homogeneous and Divergent Image Sources represent facial images sampled from:
(1) different videos of a speaker;
(2) videos of different speakers.
For vanilla synthesis, the facial images are sampled from the same full-face video corresponding to the silent lip.
Silent Lip | Homogeneous Image Sources | Divergent Image Sources | |||
Synthesized Speeches | |||||
Vanilla Synthesis | Identity Converted | ||||
Silent Lip | Homogeneous Image Sources | Divergent Image Sources | |||
Synthesized Speeches | |||||
Vanilla Synthesis | Identity Converted | ||||
Silent Lip | Homogeneous Image Sources | Divergent Image Sources | |||
Synthesized Speeches | |||||
Vanilla Synthesis | Identity Converted | ||||
Silent Lip | Homogeneous Image Sources | Divergent Image Sources | |||
Synthesized Speeches | |||||
Vanilla Synthesis | Identity Converted | ||||
Converted Examples on unseen dataset, VoxCeleb2
Converted utterances on Voxceleb2 using MuteSwap trained on LRS3 top200 dataset.
Silent Lip | Image Source 1 | Image Source 2 | Image Source 3 |
Vanilla Synthesis | Identity Converted | ||
Silent Lip | Image Source 1 | Image Source 2 | Image Source 3 |
Vanilla Synthesis | Identity Converted | ||
Silent Lip | Image Source 1 | Image Source 2 | Image Source 3 |
Vanilla Synthesis | Identity Converted | ||
Silent Lip | Image Source 1 | Image Source 2 | Image Source 3 |
Vanilla Synthesis | Identity Converted | ||
Voice interpolation
Speaker A | Speaker B | Speaker C | Speaker D | Speaker E | Speaker F | |||
A | C | E | ||||||
0.4A+0.6B | 0.4C+0.6D | 0.4E+0.6F | ||||||
0.8A+0.2B | 0.8C+0.2D | 0.8E+0.2F | ||||||
B | D | F |