MuteSwap: Silent Face-based Voice Conversion


Abstract

Conventional voice conversion relies on audio input to modify voice characteristics from a source speaker to a target speaker. However, this process becomes infeasible when clean audio is unavailable, such as in silent videos or noisy environments. To overcome this limitation, we introduce Silent Face-based Voice Conversion (SFVC), the first approach to achieve speech conversion exclusively from visual inputs. Given only target speaker images and a silent lip video, SFVC reconstructs speech that simultaneously preserves the spoken content conditioned on the target speaker’s identity, without any audio input. However, as this task requires generating intelligible speech and converting identity using only visual cues, it is particularly challenging.

SVG Image

To address this, we introduce the first approach for SFVC, MuteSwap, a novel single-stage framework that employs contrastive learning for cross-modality identity alignment and mutual information minimization for disentangling shared visual features. Experimental objective and subjective results show that MuteSwap achieves impressive performance in both speech synthesis and identity conversion, especially under noisy conditions where methods dependent on audio input fail to produce intelligible results, demonstrating both the effectiveness of our training approach and the feasibility of SFVC. Code and parameters will be made publicly available upon acceptance of this paper.

SVG Image


Comparison with FVC

MuteSwap (Identity Converted) takes the image source from reference video, while MuteSwap (Vanilla Synthesis) takes the image source just from the source video.

Sample 1

FVC SFVC Reference Video
Source Video Source Video (Noised) Silent Lip Image Source
Inputs
FVMVC
MuteSwap (Identity Converted)
SP-FaceVC
MuteSwap (Vanilla Synthesis)

Sample 2

FVC SFVC Reference Video
Source Video Source Video (Noised) Silent Lip Image Source
Inputs
FVMVC
MuteSwap (Identity Swapped)
SP-FaceVC
MuteSwap (Vanilla Synthesis)

Sample 3

FVC SFVC Reference Video
Source Video Source Video (Noised) Silent Lip Image Source
Inputs
FVMVC
MuteSwap (Identity Swapped)
SP-FaceVC
MuteSwap (Vanilla Synthesis)

Sample 4

FVC SFVC Reference Video
Source Video Source Video (Noised) Silent Lip Image Source
Inputs
FVMVC
MuteSwap (Identity Swapped)
SP-FaceVC
MuteSwap (Vanilla Synthesis)

Converted Examples on LRS3

The model learns to synthesize speech from silent lip videos conditioned on facial images.
Homogeneous and Divergent Image Sources represent facial images sampled from:
(1) different videos of a speaker;
(2) videos of different speakers.
For vanilla synthesis, the facial images are sampled from the same full-face video corresponding to the silent lip.

Silent Lip Homogeneous Image Sources Divergent Image Sources
Synthesized Speeches
Vanilla SynthesisIdentity Converted

Silent Lip Homogeneous Image Sources Divergent Image Sources
Synthesized Speeches
Vanilla SynthesisIdentity Converted

Silent Lip Homogeneous Image Sources Divergent Image Sources
Synthesized Speeches
Vanilla SynthesisIdentity Converted

Silent Lip Homogeneous Image Sources Divergent Image Sources
Synthesized Speeches
Vanilla SynthesisIdentity Converted

Converted Examples on unseen dataset, VoxCeleb2

Converted utterances on Voxceleb2 using MuteSwap trained on LRS3 top200 dataset.

Silent Lip Image Source 1 Image Source 2 Image Source 3
Vanilla Synthesis Identity Converted

Silent Lip Image Source 1 Image Source 2 Image Source 3
Vanilla Synthesis Identity Converted

Silent Lip Image Source 1 Image Source 2 Image Source 3
Vanilla Synthesis Identity Converted

Silent Lip Image Source 1 Image Source 2 Image Source 3
Vanilla Synthesis Identity Converted

Voice interpolation

Speaker A Speaker B Speaker C Speaker D Speaker E Speaker F
A C E
0.4A+0.6B 0.4C+0.6D 0.4E+0.6F
0.8A+0.2B 0.8C+0.2D 0.8E+0.2F
B D F