AirSplat: Alignment and Rating for Robust Feed-Forward 3D Gaussian Splatting

AirSplat is a novel training framework designed to significantly enhance the high-fidelity Novel View Synthesis (NVS) capabilities of 3D Vision Foundation Models (DepthAnything3).

DepthAnything3

AirSplat (Ours)

↔

Abstract

While 3D Vision Foundation Models (3DVFMs) have demonstrated remarkable zero-shot capabilities in visual geometry estimation, their direct application to generalizable novel view synthesis (NVS) remains challenging.

In this paper, we propose AirSplat, a novel training framework that effectively adapts the robust geometric priors of 3DVFMs into high-fidelity, pose-free NVS. Our approach introduces two key technical contributions: (1) Self-Consistent Pose Alignment (SCPA), a training-time feedback loop that ensures pixel-aligned supervision to resolve pose-geometry discrepancy; and (2) Rating-based Opacity Matching (ROM), which leverages the local 3D geometry consistency knowledge from a sparse-view NVS teacher model to filter out degraded primitives.

Experimental results on large-scale benchmarks demonstrate that our method significantly outperforms state-of-the-art pose-free NVS approaches in reconstruction quality. Our AirSplat highlights the potential of adapting 3DVFMs to enable simultaneous visual geometry estimation and high-quality view synthesis.

Proposed Framework

Overview of the AirSplat training pipeline.

Self-Consistent Pose Alignment (SCPA): To resolve pose-geometry discrepancies, SCPA dynamically anchors predicted target camera poses to the scene geometry via a training-time feedback loop, resulting in stable optimization with photometric loss.
Rating-based Opacity Matching (ROM): ROM measures the geometric consistency of predicted primitives using a feed-forward 3DGS teacher model and converts the resulting geometric error into a feedback rating. By directly regularizing the primitive's opacity against this rating, it naturally filters out "floaters" and inconsistent spatial artifacts.

Figure 1. Overview of the AirSplat training pipeline.

Quantitative Results

We evaluate AirSplat against state-of-the-art methods on large-scale benchmarks. Our method achieves superior performance in terms of reconstruction quality.

Table 1. Quantitative comparison on the RealEstate10K dataset.

Table 2. Quantitative comparison on the DL3DV dataset.

Table 3. Quantitative comparison on the ACID dataset for generalization.

Pose-Geometry Discrepancy

In feed-forward Novel View Synthesis (NVS), existing training paradigms face significant bottlenecks.

(a) Context-only training: leads to a lack of direct supervision for novel viewpoints, failing to generalize well.
(b) context-target training: due to asymmetric information flow, the predicted target poses often misalign with the generated 3D primitives, resulting in spatial misalignment (Pose-Geometry Discrepancy) that causes severe blurring.

To resolve this, our (c) SCPA identifies this systematic drift and applies an inverse correction. Below, we interactively visualize this Systematic Drift. If we recursively apply the network's uncorrected pose predictions, the spatial error continuously drifts in a specific direction. By applying SCPA, we achieve Aligned Render that matches the Ground Truth, enabling the network to learn both structurally consistent 3D geometry and robust novel view synthesis.

Ground Truth (GT)

SCPA Effect (Aligned vs Initial)

Visualization of Systematic Drift

Ground Truth (GT)

SCPA Effect (Aligned vs Initial)

Visualization of Systematic Drift

Ground Truth (GT)

SCPA Effect (Aligned vs Initial)

Visualization of Systematic Drift

Rating-based Opacity Matching

We utilize a two-view feed-forward 3DGS model as the teacher model, the input sequence is divided into pairs of adjacent views. The geometric errors of the pixel-aligned primitives are then computed separately for each pair and converted into ratings. Once computed, these pairwise ratings are aggregated across the sequence to map each primitive to its corresponding opacity penalty, rigorously enforcing local multi-view consistency.