CartoonAlive: Towards Expressive Live2D Modeling from Single Portraits

Tongyi Lab,Alibaba Group

Abstract

With the rapid advancement of large foundation models, AIGC, cloud rendering, and real-time motion capture technologies, digital humans are now capable of achieving synchronized facial expressions and body movements, engaging in intelligent dialogues driven by natural language, and enabling the fast creation of personalized avatars. While current mainstream approaches to digital humans primarily focus on 3D models and 2D video-based representations, interactive 2D cartoon-style digital humans have received relatively less attention. Compared to 3D digital humans that require complex modeling and high rendering costs, and 2D video-based solutions that lack flexibility and real-time interactivity, 2D cartoon-style Live2D models offer a more efficient and expressive alternative. By simulating 3D-like motion through layered segmentation without the need for traditional 3D modeling, Live2D enables dynamic and real-time manipulation. In this technical report, we present CartoonAlive, an innovative method for generating high-quality Live2D digital humans from a single input portrait image. CartoonAlive leverages the shape basis concept commonly used in 3D face modeling to construct facial blendshapes suitable for Live2D. It then infers the corresponding blendshape weights based on facial keypoints detected from the input image. This approach allows for the rapid generation of a highly expressive and visually accurate Live2D model that closely resembles the input portrait, within less than half a minute. Our work provides a practical and scalable solution for creating interactive 2D cartoon characters, opening new possibilities in digital content creation and virtual character animation.

Method

MY ALT TEXT

Overview of the CartoonAlive pipeline. (a) Facial Feature Alignment: The input portrait is first preprocessed to align the eyes horizontally, ensuring consistent orientation. Then, facial keypoints for the eyes, nose, mouth, eyebrows, and facial contour are individually detected. A transformation is computed between each set of detected keypoints and those of a predefined template model. Based on this correspondence, each facial component in the input image is aligned accordingly. (b) Facial Feature Parameter Estimation: Facial features are temporarily removed from the texture, and rendering is performed using only the underlying face image. Keypoints are then extracted from the rendered image, and corresponding Live2D parameters (e.g., position and scale) are inferred through a trained neural network. (c) Underlying Face Repainting: To eliminate visual artifacts caused by overlapping facial features during animation, the underlying face image is repainted according to a mask derived from the inferred parameters, effectively removing foreground features that may interfere with dynamic expressions. (d) Hair Texture Extraction: Hair segmentation is applied to isolate the hair region from the original image, which is then transferred into the final Live2D model as a separate texture layer. This ensures realistic integration of hair while preserving the integrity of facial components.

Features

Live2D Blendshape Design

We redesign the structure of Live2D models to support linear control of facial components along three axes: horizontal (x), vertical (y), and scaling (scale), with parameter ranges spanning from -30 to 30. This enables the creation of a diverse range of facial expressions and identities. Additionally, we modify and expand the base face template to accommodate various facial types, including long, round, and broad faces.

Accurate Facial Parameter Prediction

To enable precise parameter estimation, we synthesize a large dataset of 100,000 paired samples by rendering 1024×1024 facial images at consistent positions using the PyGame rendering engine. For each rendered image, facial landmarks are extracted and matched with their corresponding Live2D parameters. We then train a Multilayer Perceptron (MLP) to learn the mapping from facial landmarks to Live2D parameters. During inference, this network accurately predicts the necessary parameters based on the detected landmark positions from the input image.

Dynamic Artifact Correction

Once the facial parameters are obtained, the corresponding textures are placed accordingly. However, during animation, visual artifacts may occur due to misalignment between the foreground elements and the underlying face image; for example, when the eyes are closed, the background eyes may still be visible. To resolve this issue, we render facial masks based on the inferred parameters and use them to precisely identify the regions requiring inpainting. Guided by these masks, we repaint the underlying face image to eliminate visual inconsistencies, ensuring a dynamically flawless Live2D model during animation.

Hair Transfer

After aligning the facial contour, we perform hair segmentation on the input image to extract the hair mask, which is then transferred to the hair texture. If bangs occlude the eyebrows in the input image, we first remove the hair before extracting facial feature textures and parameters. Finally, we apply hair segmentation to the original image and transfer the hair to the final Live2D model.

-->

Created Characters

Textoon supports both English and Chinese prompts.

BibTeX


@misc{he2025cartoonaliveexpressivelive2dmodeling,
      title={CartoonAlive: Towards Expressive Live2D Modeling from Single Portraits}, 
      author={Chao He and Jianqiang Ren and Jianjing Xiang and Xiejie Shen},
      year={2025},
      eprint={2507.17327},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.17327}, 
}