Chorus

Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding

CVPR 2026 Oral

Yue Li‡,*,1, Qi Ma*,2,3, Runyi Yang3, Mengjiao Ma3, Bin Ren4, Nikola Popovic3, Nicu Sebe4,
Theo Gevers1, Luc Van Gool3, Danda Pani Paudel†,3, Martin R. Oswald†,1

Project lead. * Equal contribution. Equal supervision.

1University of Amsterdam, 2ETH Zürich, 3INSAIT, Sofia University “St. Kliment Ohridski”, 4University of Trento

Motivation

From 3DGS Semantic Feature to Holistic Scene Encoder

SceneSplat showed that lifted 2D language features can supervise a feed-forward 3DGS scene encoder for open-vocabulary segmentation. Chorus starts from the same 3DGS substrate, but asks for a broader representation: a compact scene embedding that can support semantics, instances, querying, and VLM reasoning.

Current 3DGS scene-encoder limitations

Open-vocabulary semantic focus

Existing 3DGS scene encoder focuses on language-aligned, open-vocabulary semantic learning. Objectness, correspondence, and fine spatial structure are less directly supervised.

Limited embedding capability

A reusable scene embedding should go beyond open-vocabulary semantics: it should encode geometry, appearance, instance structure, and supports multiple downstream tasks.

Token-heavy final features

SceneSplat encoder can retain close to 10K pooled multi-level tokens. LLM/VLM interfaces benefit from a smaller final-stage scene representation.

Chorus addresses these limits by pretraining general-purpose 3D scene encoders from Gaussian splats via multi-teacher distillation of 2D foundation models. A shared 3DGS backbone captures common scene factors, while lightweight teacher-specific projectors absorb language-aligned, generalist, and object-aware embedding differences.
Multi-teacher supervision Compact scene tokens One encoder for diverse tasks
Chorus teaser figure showing multi-teacher pretraining and evaluation

Demo

Chorus Inference Feature PCA

At inference, Chorus processes an entire reconstructed scene in one pass. The viewer pairs RGB rendering with learned feature PCA, making the scene embedding visible. The toggle switches the PCA view between Gaussian splats and their point-center proxy.

InteriorGS Scenes

Interactive 3DGS viewer

Ready

Rendering
Chorus PCA

HM3D Scenes

Interactive 3DGS viewer

Ready

Rendering
Chorus PCA

Marble Generated Scenes

Interactive 3DGS viewer

Ready

Rendering
Chorus PCA

Method

Chorus Pipeline

Chorus distills language-aligned, generalist, and object-aware 2D teacher signals into a shared scene embedding using lightweight per-teacher projectors. The pretrained embedding transfers to downstream modules for semantic and instance segmentation, open-vocabulary queries, and 3D scene Q&A. For new 3DGS data, render-and-distill adaptation finetunes the trained encoder from rendered views with online teacher supervision.

Chorus pretraining pipeline

3DGS Results

Zero-Shot 3D Semantic Segmentation

The core 3DGS evaluation is zero-shot fine-grained segmentation: the encoder produces language-aligned scene features without task-specific training on the target labels. Chorus improves over SceneSplat across these 3DGS semantic benchmarks while remaining compact and feed-forward.

Zero-shot 3D semantic segmentation comparison for Mosaic3D, SceneSplat, and Chorus
Zero-shot fine-grained segmentation: Chorus improves foreground-mIoU over SceneSplat on ScanNet200, Matterport3D, ScanNet++, and InteriorGS.
+2.1 foreground-mIoU on ScanNet200 vs. SceneSplat
+5.7 foreground-mIoU on InteriorGS for novel-scene generalization
8.32× fewer training scenes than Mosaic3D

Presentation

One More Thing

The Same Recipe Trains a Strong Point-Cloud Encoder

This is the surprising part: Chorus does not need a new pretraining framework nor new data for building point clouds encoder. We keep the same multi-teacher distillation pipeline and change only the encoder input channels.

same Chorus framework multi-teacher distillation
SigLIP2 DINOv3 PE-Spatial

Before

3DGS input

coord color scale opacity quat estimated normals

Full 3DGS Chorus consumes position, color, scale, opacity, and orientation. The point-cloud variant consumes only Gaussian centers, colors, and estimated normals; all teachers, losses, and distillation stages stay unchanged. At inference, the trained variant operates as a point-cloud encoder.

Segmentation probing and finetuning comparison between Sonata and Chorus
Point-cloud variant performance: semantic segmentation probing, instance segmentation probing, and training-scene and -label efficiency evaluation.

Why does Chorus framework transfer to point clouds?

We link three hypotheses with explanations: 3DGS pretraining acts like structured augmentation, multi-teacher supervision scales more efficiently than the self-supervised learning route, and heterogeneous teachers contribute complementary but alignable signals.

(i) Noise-robust features.

Clean-to-perturbed retrieval and iPhone RGB-D probing show that the point-cloud variant remains stable under input noise and data capture changes.

(ii) Efficient scaling.

Scene-level self-supervised trained encoder on 3DGS is less competitive in our ablations; distilling foundation-model features scales faster as training scenes increase.

(iii) Complementary teachers.

SigLIP2/PE-Spatial supply language-aligned and object-aware cues, DINOv3 supplies generalist structure, and per-teacher projectors keep idiosyncrasies from competing in one output embedding space.

Evidence for (i): robustness under perturbation and capture shift

Chorus retrieves perturbed instances more reliably than Sonata, and probing performance remains strong when ScanNet++ input changes from laser scans to iPhone RGB-D point clouds.

Feature-space check: compact and separated clusters

UMAP on per-point features from 15 semantic classes gives a direct view of the learned embedding. Chorus forms tighter within-instance clusters and clearer inter-class margins than Sonata.

Sonata feature UMAP with scattered and overlapping clusters
Sonata SSL point-cloud encoder
Chorus feature UMAP with compact and separated clusters
Chorus point-cloud variant

Evidence for (ii) and (iii): scaling and teacher contribution

The scaling trend shows the performance and efficiency gains of multi-teacher pretraining, while the teacher ablation shows that DINOv3 and PE-Spatial add consistent gains and generalization beyond language supervision alone.

Scaling trend comparing SSL pretraining and Chorus pretraining
Multi-teacher pretraining scales faster and reaches higher linear-probing mIoU than self-supervised pretraining.
Teacher ablation table showing added teachers improve performance
Added teachers drive gains on ScanNet++ and InteriorGS generalization.

Applications

3D Scene Question Answering

Chorus can be dropped into GaussianVLM as the 3DGS scene encoder. Using only final-stage Chorus features in place of SceneSplat improves key ScanQA/Nr3D metrics while providing a lighter token interface and reducing training time to about 0.68× that of the SceneSplat-based GaussianVLM.

3D scene question answering quantitative comparison
Quantitative Q&A: Chorus-based GaussianVLM improves on ScanQA/Nr3D benchmarks.
GaussianVLM qualitative examples comparing Chorus and SceneSplat
Qualitative Q&A examples: Chorus-based GaussianVLM answers the grounded object queries correctly.

Online Demo with Chorus Encoder

Chorus encoder on point-cloud and 3DGS scenes.

Point-cloud scene

Point-cloud scene

3DGS scene

3DGS scene

Citation

@InProceedings{Li_2026_CVPR,
    author    = {Li, Yue and Ma, Qi and Yang, Runyi and Ma, Mengjiao and Ren, Bin and Popovic, Nikola and Sebe, Nicu and Gevers, Theo and Van Gool, Luc and Paudel, Danda Pani and Oswald, Martin R.},
    title     = {Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {21431-21442}
}
Citation copied to clipboard