FastAvatar: Towards Unified and Fast 3D Avatar Reconstruction with Large Gaussian Reconstruction Transformers

ICLR 2026

¹Tongji University, ²Shanghai Innovation Institute, ³Shanghai Jiao Tong University, ⁴AKool
^*Corresponding author

Core Highlights

Unified model with flexible multi-frame aggregation for ultra-high-fidelity avatars.
Fast feedforward reconstruction that delivers high-quality 3DGS models within seconds.
Incremental reconstruction from diverse inputs, including single-shot, monocular, and multi-view observations.

Abstract

Despite significant progress in 3D avatar reconstruction, it still faces challenges such as high time complexity, sensitivity to data quality, and low data utilization. We propose FastAvatar, a feedforward 3D avatar framework capable of flexibly leveraging diverse daily recordings (e.g., a single image, multi-view observations, or monocular video) to reconstruct a high-quality 3D Gaussian Splatting (3DGS) model within seconds, using only a single unified model. The core of FastAvatar is a Large Gaussian Reconstruction Transformer (LGRT) featuring three key designs: First, a 3DGS transformer aggregating multi-frame cues while injecting initial 3D prompt to predict the corresponding registered canonical 3DGS representations; Second, multi-granular guidance encoding (camera pose, expression coefficient, head pose) mitigating animation-induced misalignment for variable-length inputs; Third, incremental Gaussian aggregation via landmark tracking and sliced fusion losses. Integrating these features, FastAvatar enables incremental reconstruction, i.e., improving quality with more observations without wasting input data as in previous works. This yields a quality-speed-tunable paradigm for highly usable avatar modeling. Extensive experiments show that FastAvatar has a higher quality and highly competitive speed compared to existing methods. Code and models are available at the project repository.

FastAvatar is a feedforward framework designed to reconstruct a high-quality, animatable 3D Gaussian Splatting (3DGS) avatar from an unordered, variable-length set of observations such as a single selfie, monocular video frames, or multi-view captures. The model consumes RGB observations together with camera parameters, expression coefficients, and head pose, then outputs a canonical 3DGS avatar that can be animated from arbitrary viewpoints.

The pipeline is centered on the Large Gaussian Reconstruction Transformer (LGRT):

Face Encoding: Each input frame is encoded into facial tokens with DINOv2, then augmented with camera pose, expression coefficient, and head-pose encodings so the model can distinguish appearance changes across frames.
3DGS Transformer Aggregation: Frame attention injects coarse 3D positional prompts and aggregates intra-frame features, while global attention registers and fuses information across variable-length observations.
3DGS Attribute Generation: A shared MLP-based GS Head predicts Gaussian attributes including color, opacity, scale, rotation, importance score, and point offset for each registered frame representation.
Canonical 3DGS Fusion: Frame-wise Gaussian representations are fused into a canonical avatar model, combining complementary details from different views, expressions, and poses.
Incremental Reconstruction: Landmark tracking loss and sliced fusion loss supervise accurate registration and fusion, enabling reconstruction quality to improve as more observations become available.

BibTeX

@inproceedings{wu2026fastavatar, title={FastAvatar: Towards Unified and Fast 3D Avatar Reconstruction with Large Gaussian Reconstruction Transformers}, author={Yue Wu and Xuanhong Chen and Yufan Wu and Wen Li and Yuxi Lu and Kairui Feng}, booktitle={The Fourteenth International Conference on Learning Representations}, year={2026}, url={https://openreview.net/forum?id=P7zBSCs4Xt} }

FastAvatar: Towards Unified and Fast 3D Avatar Reconstruction with Large Gaussian Reconstruction Transformers

FastAvatar is a feedforward 3D avatar framework capable of flexibly leveraging diverse daily recordings (e.g., a single image, multi-view observations, or monocular video) to reconstruct a high-quality 3D Gaussian Splatting (3DGS) model within seconds, using only a single unified model.

News

Core Highlights

Abstract

Method

Self-reenacted Results

Cross-reenacted Results

Multi-view & Incremental Reconstruction Results

BibTeX