Unsupervised face reenactment aims to animate a source image to imitate the motions of a target image while preserving the source portrait attributes (e.g., facial geometry/ identity, hair texture, and background in the generated images). While prior methods can extract the motion from the target image via compact representations (e.g., key-points or motion bases), they struggle with preserving portrait attributes in the cross-subject reenactment, as the predicted motion representations are often coupled with portrait attributes of the target image.
In this work, we propose an effective and cost-efficient face reenactment approach to address this issue. Our approach is highlighted by two major strengths. First, based on the theory of latent-motion bases, we decompose the full-head motion into two parts: the transferable motion and preservable motion, and then compose the full motion representation using latent motions from both the source image and the target image. Second, to optimize and learn disentangled motions, we introduce an efficient training framework, which features two training strategies 1) a mixture training strategy that encompasses self-reenactment training and cross-subject training for better motion disentanglement; and 2) a multi-path training strategy that improves the visual consistency of portrait attributes. Extensive experiments on widely used benchmarks demonstrate that our method exhibits remarkable generalization ability, e.g., better motion accuracy and portrait attribute preservation capability, compared to state-of-the-art baselines.
Fig. 1. Illustration of our face reenactment framework. We incorporate two latent bases for complete latent representation. The encoder E projects an image into transferable latent coefficients and preservable latent coefficients. We employ a latent composition approach to compose latent motions through linear composition. Then, we employ a generator G to gradually synthesize final images from the encoder features and the composed latent motions.
Fig. 2. Proposed training framework. Differing from many preceding approaches that only use self-reenactment during training, our training framework incorporates 1) a cross-subject training strategy to minimize the gap between training and inference, and 2) a multi-path reenactment strategy and multi-path regularization loss to improve consistency of visual attributes. For cross-subject training, we introduce four effective losses to stabilize the optimization.
@article{ling2023vicoface,
author = {Ling, Jun and Xue, Han and Tang, Anni and Xie, Rong and Song, Li},
title = {ViCoFace: Learning Disentangled Latent Motion Representations for Visual-Consistent Face Reenactment},
journal = {arxiv},
year = {2023},
}