ViCoFace: Learning Disentangled Latent Motion Representations for Visual-Consistent Face Reenactment

ACM Transactions on Multimedia Computing, Communications and Applications (TOMM), 2024
Jun Ling, Han Xue, Anni Tang, Rong Xie, Li Song
Shanghai Jiao Tong University

Update

  • [8/2024] !!! We add the inference code on Github.
  • [7/2024] !!! We add the video results of ablation study for better visualization and add the video results of Bi-layer model.
  • [11/2023] We upload the demo videos of our ViCoFace.
  • TL;DR: We propose ViCoFace, a robust face reenactment framework to transfer facial motions while preserving the source portrait attributes. Our method outperforms previous motion transfer methods in both self-reenactment (i.e., same-identity reenactment) and cross-subject reenactment.


    Abstract

    Unsupervised face reenactment aims to animate a source image to imitate the motions of a target image while preserving the source portrait attributes (e.g., facial geometry/ identity, hair texture, and background in the generated images). While prior methods can extract the motion from the target image via compact representations (e.g., key-points or motion bases), they struggle with preserving portrait attributes in the cross-subject reenactment, as the predicted motion representations are often coupled with portrait attributes of the target image.

    In this work, we propose an effective and cost-efficient face reenactment approach to address this issue. Our approach is highlighted by two major strengths. First, based on the theory of latent-motion bases, we decompose the full-head motion into two parts: the transferable motion and preservable motion, and then compose the full motion representation using latent motions from both the source image and the target image. Second, to optimize and learn disentangled motions, we introduce an efficient training framework, which features two training strategies 1) a mixture training strategy that encompasses self-reenactment training and cross-subject training for better motion disentanglement; and 2) a multi-path training strategy that improves the visual consistency of portrait attributes. Extensive experiments on widely used benchmarks demonstrate that our method exhibits remarkable generalization ability, e.g., better motion accuracy and portrait attribute preservation capability, compared to state-of-the-art baselines.

    Method Overview

    Generator

    Interpolate start reference image.

    Fig. 1. Illustration of our face reenactment framework. We incorporate two latent bases for complete latent representation. The encoder E projects an image into transferable latent coefficients and preservable latent coefficients. We employ a latent composition approach to compose latent motions through linear composition. Then, we employ a generator G to gradually synthesize final images from the encoder features and the composed latent motions.


    Training Framework

    Interpolate start reference image.

    Fig. 2. Proposed training framework. Differing from many preceding approaches that only use self-reenactment during training, our training framework incorporates 1) a cross-subject training strategy to minimize the gap between training and inference, and 2) a multi-path reenactment strategy and multi-path regularization loss to improve consistency of visual attributes. For cross-subject training, we introduce four effective losses to stabilize the optimization.


    Video Results

    Ablation Study

    Motion Latent Visualization

    VoxCeleb-Test


    HDTF


    BibTeX

    @article{ling2023vicoface,
          author    = {Ling, Jun and Xue, Han and Tang, Anni and Xie, Rong and Song, Li},
          title     = {ViCoFace: Learning Disentangled Latent Motion Representations for Visual-Consistent Face Reenactment},
          journal   = {arxiv},
          year      = {2023},
        }