MuteSwap

Abstract

Conventional voice conversion relies on audio input to modify voice characteristics from a source speaker to a target speaker. However, this process becomes infeasible when clean audio is unavailable, such as in silent videos or noisy environments. To overcome this limitation, we introduce Silent Face-based Voice Conversion (SFVC), the first approach to achieve speech conversion exclusively from visual inputs. Given only target speaker images and a silent lip video, SFVC reconstructs speech that simultaneously preserves the spoken content conditioned on the target speaker’s identity, without any audio input. However, as this task requires generating intelligible speech and converting identity using only visual cues, it is particularly challenging.

SVG Image

To address this, we introduce the first approach for SFVC, MuteSwap, a novel single-stage framework that employs contrastive learning for cross-modality identity alignment and mutual information minimization for disentangling shared visual features. Experimental objective and subjective results show that MuteSwap achieves impressive performance in both speech synthesis and identity conversion, especially under noisy conditions where methods dependent on audio input fail to produce intelligible results, demonstrating both the effectiveness of our training approach and the feasibility of SFVC. Code and parameters will be made publicly available upon acceptance of this paper.

SVG Image

Comparison with FVC

MuteSwap (Identity Converted) takes the image source from reference video, while MuteSwap (Vanilla Synthesis) takes the image source just from the source video.

Sample 1

	FVC		SFVC	Reference Video
	Source Video	Source Video (Noised)	Silent Lip	Image Source
Inputs
FVMVC			MuteSwap (Identity Converted)
SP-FaceVC			MuteSwap (Vanilla Synthesis)

Sample 2

	FVC		SFVC	Reference Video
	Source Video	Source Video (Noised)	Silent Lip	Image Source
Inputs
FVMVC			MuteSwap (Identity Swapped)
SP-FaceVC			MuteSwap (Vanilla Synthesis)

Sample 3

	FVC		SFVC	Reference Video
	Source Video	Source Video (Noised)	Silent Lip	Image Source
Inputs
FVMVC			MuteSwap (Identity Swapped)
SP-FaceVC			MuteSwap (Vanilla Synthesis)

Sample 4

	FVC		SFVC	Reference Video
	Source Video	Source Video (Noised)	Silent Lip	Image Source
Inputs
FVMVC			MuteSwap (Identity Swapped)
SP-FaceVC			MuteSwap (Vanilla Synthesis)

Converted Examples on LRS3

The model learns to synthesize speech from silent lip videos conditioned on facial images.
Homogeneous and Divergent Image Sources represent facial images sampled from:
(1) different videos of a speaker;
(2) videos of different speakers.
For vanilla synthesis, the facial images are sampled from the same full-face video corresponding to the silent lip.

Silent Lip	Homogeneous Image Sources	Divergent Image Sources

Synthesized Speeches
Vanilla Synthesis	Identity Converted


Silent Lip	Homogeneous Image Sources	Divergent Image Sources

Synthesized Speeches
Vanilla Synthesis	Identity Converted


Silent Lip	Homogeneous Image Sources	Divergent Image Sources

Synthesized Speeches
Vanilla Synthesis	Identity Converted


Silent Lip	Homogeneous Image Sources	Divergent Image Sources

Synthesized Speeches
Vanilla Synthesis	Identity Converted

Converted Examples on unseen dataset, VoxCeleb2

Converted utterances on Voxceleb2 using MuteSwap trained on LRS3 top200 dataset.

Silent Lip	Image Source 1	Image Source 2	Image Source 3

Vanilla Synthesis	Identity Converted


Silent Lip	Image Source 1	Image Source 2	Image Source 3

Vanilla Synthesis	Identity Converted


Silent Lip	Image Source 1	Image Source 2	Image Source 3

Vanilla Synthesis	Identity Converted


Silent Lip	Image Source 1	Image Source 2	Image Source 3

Vanilla Synthesis	Identity Converted

Voice interpolation

	Speaker A	Speaker B		Speaker C	Speaker D		Speaker E	Speaker F

A			C			E
0.4A+0.6B			0.4C+0.6D			0.4E+0.6F
0.8A+0.2B			0.8C+0.2D			0.8E+0.2F
B			D			F