StableFace: Analyzing and Improving Motion Stability for Talking Face Generation

Updates

[10/2023] !!! We add three supplementary videos at the bottom of this page to investigate the effectiveness of dependency module and MSI.

[4/2023] We add additional comparison videos with previous approaches, including NVP, AD-NeRF, LiveSP, Wav2Lip, and Vougioukas et. al.

Abstract

While previous methods for speech-driven talking face generation have shown significant advances in improving the visual and lip-sync quality of the synthesized videos, they have paid less attention to lip motion jitters which can substantially undermine the perceived quality of talking face videos. What causes motion jitters, and how to mitigate the problem? In this paper, we conduct systematic analyses on the motion jittering problem based on a state-of-the-art pipeline that utilizes 3D face representations to bridge the input audio and output video, and implement several effective designs to improve motion stability. This study finds that several factors can lead to jitters in the synthesized talking face video, including jitters from the input face representations, training-inference mismatch, and a lack of dependency modeling in the generation network.

Accordingly, we propose three effective solutions: 1) a Gaussian-based adaptive smoothing module to smooth the 3D face representations to eliminate jitters in the input; 2) augmented erosions added to the input data of the neural renderer in training to simulate the inference distortion to reduce mismatch; 3) an audio-fused transformer generator to model inter-frame dependency. In addition, considering there is no off-the-shelf metric that measures motion jitters for talking face video, we devise an objective metric (Motion Stability Index, MSI) to quantitatively measure the motion jitters. Extensive experimental results show the superiority of the proposed method on motion-stable talking video generation, with superior quality to previous systems.

Pipeline

Demo

Video of Fig. 1

To visualize the lip movements, we concatenate the vertical slice (represented by a red/green/blue line) from each frame and then show the concatenated results at the bottom.

Additional Comparisons

Different Dependency Modeling Modules

We test on autoregressive, LSTM, GRU, and Transformer-based methods.

Impact of Keypoint Accuracy

Using MSI as a temporal loss for audio2expression training

BibTeX

@article{ling2022stableface,
    title={StableFace: Analyzing and Improving Motion Stability for Talking Face Generation},
    author={Ling, Jun and Tan, Xu and Chen, Liyang and Li, Runnan and Zhang, Yuchao and Zhao, Sheng and Song, Li},
    journal={arXiv preprint arXiv:2208.13717},
    year={2022}
  }