The 2D cartoon style is a prominent art form in digital character creation, particularly popular among younger audiences. While advancements in digital human technology have spurred extensive research into photorealistic digital humans and 3D characters, interactive 2D cartoon characters have received comparatively less attention. Unlike 3D counterparts, which require sophisticated construction and resource-intensive rendering, Live2D, a widely-used format for 2D cartoon characters, offers a more efficient alternative, which allows to animate 2D characters in a manner that simulates 3D movement without the necessity of building a complete 3D model. Furthermore, Live2D employs lightweight HTML5 (H5) rendering, improving both accessibility and efficiency. In this technical report, we introduce Textoon, an innovative method for generating diverse 2D cartoon characters in the Live2D format based on text descriptions. The Textoon leverages cutting-edge language and vision models to comprehend textual intentions and generate 2D appearance, capable of creating a wide variety of stunning and interactive 2D characters within one minute.
 
        The overview of Textoon. This framework leverages fine-tuned LLMs to accurately extract component description words from user input text, using the corresponding components to control the appearance generation of 2D cartoon characters. It allows users to re-edit details and uses the components to extract and complete the generated images into Live2D model textures. The resulting Live2D models are diverse and compatible with original animation.
Our text parsing model excels at extracting detailed information from complex user descriptions. It accurately identifies features such as back hair, side hair, bangs, eye color, eyebrows, face shape, clothing type, and shoe type. This advanced text parsing capability allows for more flexible user inputs.
After parsing the text, each component is synthesized into a comprehensive character template. The contour boundaries offer precise control over the shape of the generated character, while a text-to-image model takes charge of generating the inner color and texture.
If users are not satisfied with the initial generated result and wish to modify specific details,our framework provides assistance in selecting specific positions to add, remove, or modify elements.
The control coefficients for the Live2D model's mouth primarily include MouthOpenY and MouthForm. MouthOpenY controls the vertical movement of the mouth, while MouthForm adjusts the expressions, such as upturning and grimacing. However, these controls often result in suboptimal driving performance. To enhance the accuracy of speech animations for cartoon characters, we integrate ARKit's face blend shape capabilities into the Live2D lip-sync functionality. This integration significantly improves the realism and precision of the animated speech.
Textoon supports both English and Chinese prompts.
@article{he2025textoon,
  title={Textoon: Generating Vivid 2D Cartoon Characters from Text Descriptions},
  author={Chao He and Jianqiang Ren and Yuan Dong and Jianjing Xiang and Xiejie Shen and Weihao Yuan and Liefeng Bo},
  journal={arXiv preprint arXiv:2501.10020},
  year={2025}
}