FastAvatar: Towards Unified Fast High-Fidelity 3D Avatar Reconstruction with Large Gaussian Reconstruction Transformers

¹Tongji University, ²Shanghai Innovation Institute, ³Shanghai Jiao Tong University, ⁴Akool Research
^*Corresponding authors

Abstract

Despite significant progress in 3D avatar reconstruction, it still faces challenges such as high time complexity, sensitivity to data quality, and low data utilization. We propose FastAvatar, a feedforward 3D avatar framework capable of flexibly leveraging diverse daily recordings (e.g., a single image, multi-view observations, or monocular video) to reconstruct a high-quality 3D Gaussian Splatting (3DGS) model within seconds, using only a single unified model. FastAvatar’s core is a Large Gaussian Reconstruction Transformer featuring three key designs: First, a variant VGGT-style transformer architecture aggregating multi-frame cues while injecting initial 3D prompt to predict an aggregatable canonical 3DGS representation; Second, multi-granular guidance encoding (camera pose, FLAME expression, head pose) mitigating animation-induced misalignment for variable-length inputs; Third, incremental Gaussian aggregation via landmark tracking and sliced fusion losses. Integrating these features, FastAvatar enables incremental reconstruction, i.e., improving quality with more observations, unlike prior work wasting input data. This yields a quality-speed-tunable paradigm for highly usable avatar modeling. Extensive experiments show that FastAvatar has higher quality and highly competitive speed compared to existing methods.

FastAvatar is a feed-forward framework designed for high-quality, animatable 3D Gaussian Splatting (3DGS) avatars reconstruction from variable-length input observations such as single images or multi-view video sequences. Unlike optimization-based methods, FastAvatar leverages a Large Gaussian Reconstruction Transformer (LGRT) architecture to efficiently achieve robust reconstruction fidelity while supporting dynamic facial expressions and poses.

The pipeline can be summarized as follows:

Facial Tokenization: Each input frame is processed into latent tokens via DINOv2 for feature extraction. Tokens are encoded with expression coefficients, pose, and camera information to distinguish facial features across frames. Patchified tokens ensure compatibility with subsequent transformer-based processing.
Attention-Based Aggregation: Tokens are fused using both frame attention (intra-frame processing via positional prompts) and global attention (cross-frame alignment). This enables accurate 3D spatial registration and consistency across variable-length inputs.
Gaussian Splatting Prediction: Aggregated tokens are processed through an MLP-based GS Head to predict Gaussian splatting attributes such as scale, position, rotation, color, and opacity. These attributes are rasterized into a high-quality 3D Gaussian point cloud representation for animatable avatars.
Canonical Fusion: Multi-view Gaussian splatting representations are fused into a canonical 3DGS model, guided by loss functions like Landmark Tracking Loss and Sliced Fusion Loss to ensure proper alignment and eliminate artifacts.
Efficient Rendering: The reconstructed 3DGS avatar is driven by expression codes (e.g., derived from FLAME) for animatable facial movement control, achieving real-time rendering speeds of ~55 FPS with 16 input frames on an RTX 4090 GPU.

BibTeX

@misc{wu2025fastavatarunifiedfasthighfidelity, title={FastAvatar: Towards Unified Fast High-Fidelity 3D Avatar Reconstruction with Large Gaussian Reconstruction Transformers}, author={Yue Wu and Yufan Wu and Wen Li and Yuxi Lu and Kairui Feng and Xuanhong Chen}, year={2025}, eprint={2508.19754}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2508.19754}, }

FastAvatar: Towards Unified Fast High-Fidelity 3D Avatar Reconstruction with Large Gaussian Reconstruction Transformers

FastAvatar is a feedforward 3D avatar framework capable of flexibly leveraging diverse daily recordings (e.g., a single image, multi-view observations, or monocular video) to reconstruct a high-quality 3D Gaussian Splatting (3DGS) model within seconds, using only a single unified model.

Abstract

Method

Self-reenacted Results

Cross-reenacted Results

Multi-view & Incremental Reconstruction Results

BibTeX