F3DGS: Federated 3D Gaussian Splatting for Decentralized Multi-Agent World Modeling

Problem

Centralized 3D Gaussian Splatting (3DGS) pipelines [Kerbl et al. 2023] assume all training images reside on a single machine. This fails in multi-agent robotic settings where agents independently collect observations and centralized data aggregation is impractical or prohibited due to three constraints:

Communication cost — aggregating high-resolution images from $K$ agents scales linearly with fleet size, exceeding field-deployable bandwidth limits.
Data privacy — multi-operator deployments prevent raw sensor data sharing.
Scalability — joint optimization couples memory/compute to total fleet size.

Standard federated averaging (FedAvg [McMahan et al. 2017]) applied directly to 3DGS introduces two domain-specific challenges:

Geometric drift — Gaussian primitives encode explicit 3D positions ($\mu_i \in \mathbb{R}^3$). Independent per-client optimization causes positions to drift into geometrically inconsistent configurations; averaging drifted positions yields incoherent reconstructions.
Partial observability — each client observes only a contiguous temporal segment. Many Gaussians are visible to some clients but not others. Uniform averaging dilutes well-observed appearance estimates with uninformative updates from non-observing clients.

Prior work is limited: FedNeRF [Suzuki et al. 2024] federates NeRF MLPs but NeRF entangles geometry and appearance in shared parameters, precluding selective freezing. Fed3DGS [Suzuki 2024] applies distillation-based server updates to 3DGS but does not prevent geometric drift and incurs additional rendering cost per aggregation step. CoSurfGS [Gao et al. 2024] enables distributed Gaussian surface reconstruction but assumes cooperative model sharing rather than a federated data constraint. Collaborative SLAM methods (Swarm-SLAM [Lajoie et al. 2024], CCM-SLAM [Schmuck & Chli 2019], Kimera-Multi [Tian et al. 2022]) produce point clouds or occupancy grids, not photorealistic renderable representations.

Architecture

F3DGS decouples geometry from appearance across three stages:

1. Metric-Scale Pose Construction (Backend-Agnostic)

Each client runs a local monocular pose estimator (e.g., Depth Anything V2 [Yang et al. 2024]) producing trajectory $\mathbf{T}^{(k)}$. A global LiDAR-based anchor trajectory (e.g., KISS-SLAM [Guadagnino et al. 2025]) provides metric scale. Client trajectories are aligned via Umeyama Sim(3) estimation:

\[s_k, R_k, t_k = \arg\min_{s, R, t} \sum_j \lVert p^{\text{anchor}}_j - (s R p^{\text{client}}_j + t) \rVert^2\]

Boundary smoothing prevents discontinuities at client boundaries using exponentially decayed SE(3) corrections over a window of $\tau$ frames:

\[\mathbf{T}^{\text{smooth}}_t = \text{Exp}(\beta(t) \cdot \text{Log}(\Delta T)) \cdot \mathbf{T}^{\text{aligned}}_t\]

where $\beta(t) = 1 - t/\tau$ linearly decays to identity.

2. Global Gaussian Initialization (LiDAR Geometric Scaffold)

LiDAR point clouds from all clients are fused into a common world frame. Each point initializes one Gaussian primitive:

\[\mathbf{g}_i = (\mu_i, s_i, q_i, \alpha_i, c_i)\]

where $\mu_i \in \mathbb{R}^3$ is center, $s_i \in \mathbb{R}^3$ log-scale (from KNN distance to 3 nearest neighbors), $q_i \in \mathbb{R}^4$ rotation quaternion, $\alpha_i \in \mathbb{R}$ logit-opacity (small positive), and $c_i$ SH color coefficients (from corresponding RGB). Total: $M \approx 6 \times 10^5$ Gaussians.

3. Visibility-Aware Federated Aggregation

Each client transmits updated appearance parameters $\theta^{(k)}_{\text{app}} = {s_i, q_i, \alpha_i, c_i}_{i=1}^M$ and per-Gaussian visibility counts $v_{k,i}$ (number of rasterization passes where Gaussian $i$ had positive projected radius). Server computes normalized weights:

\[\alpha_{k,i} = \frac{v_{k,i}}{\sum_{j=1}^{K} v_{j,i} + \epsilon}, \quad \epsilon = 10^{-8}\]

Global attribute is the weighted combination:

\[a_i = \sum_{k=1}^{K} \alpha_{k,i} \, a_{k,i}\]

Gaussians with zero total visibility retain their previous global values. Quaternion signs are aligned before averaging and the result is normalized.

Training

Problem Formulation

$N$ images with poses $\mathcal{D} = {(I_t, \mathbf{T}_t)}_{t=1}^N$ are partitioned into $K$ clients with chunk size $C=400$. Client $k$ holds:

\[\mathcal{D}_k = \{(I_t, \mathbf{T}_t) \mid t \in I_k\}, \quad I_k = [kC, \min((k+1)C-1, N-1)]\]

Objective: minimize $\sum_{k=1}^{K} \mathcal{L}_k(\mathbf{G})$.

Per-Client Rendering Loss

\[\mathcal{L}_k = \sum_{t \in I_k} \left[ (1-\lambda) \lVert I_t - \hat{I}_t \rVert_1 + \lambda (1 - \text{SSIM}(I_t, \hat{I}_t)) \right]\]

with $\lambda = 0.2$.

Frozen Geometry

\[\mu^{(k)}_i = \mu_i \quad \forall k, \forall \text{steps}\]

Only appearance parameters $\theta_{\text{app}} = {s_i, q_i, \alpha_i, c_i}_{i=1}^M$ receive gradient updates. Adaptive density control (ADC) is disabled — no split/clone/prune during federated training.

Hyperparameters

Parameter	Value
Gaussians $M$	$6 \times 10^5$ (fixed)
Chunk size $C$	400 frames
Communication rounds $R$	7
Local steps per round $T$	1000
Total optimization budget	7000 steps
Loss weight $\lambda$	0.2
Visibility smoothing $\epsilon$	$10^{-8}$
Validation sampling	Every 8th frame

Communication Frequency Ablation

Under fixed 7000-step budget, more frequent aggregation improves global quality with diminishing returns:

$R=1, T=7000$: Global PSNR 21.95 / 23.69 (seq 07/08)
$R=7, T=1000$: Global PSNR 22.74 / 23.94
$R=14, T=500$: Global PSNR 22.84 / 24.01

Improvement is most pronounced in the low-round regime. Beyond $R=7$, gains are marginal relative to added communication cost.

Evaluation

Dataset: MeanGreen

9 indoor corridor sequences (03–11), 8,739 total frames. Collected by mobile robot with:

Sensor	Spec
Camera	OAK-D-Pro RGB, $1280 \times 720$, $63°$ FOV
LiDAR	Velodyne VLP-16, 16 channels, 10 Hz, 100 m range, $\pm 15°$ vertical FOV

Scenes feature sparse texture, repeated structures, and a forward-facing camera with limited FOV — challenging for novel-view synthesis.

Metrics

PSNR (↑), SSIM (↑), LPIPS (↓). Two protocols: Local (per-client validation) and Global (aggregated model on union of all validation frames).

Sequence-Level Results (Selected)

Seq	Frames	$K$	Local PSNR	Global PSNR	Local SSIM	Global SSIM	Local LPIPS	Global LPIPS
03	1840	5	22.36	18.82	0.783	0.711	0.428	0.523
05	543	2	23.71	22.65	0.818	0.808	0.391	0.415
08	718	2	24.52	23.94	0.853	0.832	0.374	0.388
10	856	3	25.07	22.87	0.845	0.812	0.353	0.392
11	552	2	24.01	22.77	0.827	0.808	0.366	0.385

Key Findings

Local reconstruction is consistently high: PSNR $>22$ across all sequences. Freezing geometry does not hinder within-client reconstruction.
Global degradation is concentrated in a small subset of clients, not uniformly distributed. E.g., seq 03 client 1 drops from 19.97 → 15.44 PSNR; seq 09 client 2 drops from 29.28 → 19.53. In contrast, seq 08 client 0: 25.66 → 25.61 (near-perfect preservation).
On well-behaved sequences (05, 08, 11), the global–local gap is within ~0.5–1.0 dB PSNR.
Worst-case degradation (seq 03) shows ~3.5 dB drop, attributed to difficult temporal boundaries and short segments.

Reproduction Guide

Code and dataset are not yet publicly released (paper states “will be publicly released”). The following reproduces the methodology given available tools:

1. Environment Setup

# Prerequisites (assumed)
pip install torch torchvision
# 3DGS codebase (e.g., gsplat or original 3DGS)
git clone https://github.com/nerfstudio-project/gsplat.git
# LiDAR odometry
pip install kiss-icp
# Monocular depth for pose estimation
pip install depth-anything-v2

2. Pose Construction

# Step 1: Run KISS-SLAM on LiDAR to get global anchor trajectory
kiss_icp_dataset /path/to/lidar --config indoor

# Step 2: Run Depth Anything V2 per client for monocular depth + pose
# Each client processes its C=400 frame chunk independently
python run_depth_anything.py --images /path/to/client_k/images \
    --output poses_client_k.npy

# Step 3: Umeyama alignment of each client trajectory to LiDAR anchor
python align_trajectory.py --client_poses poses_client_k.npy \
    --anchor_poses lidar_trajectory.npy --output aligned_k.npy

# Step 4: Boundary smoothing with window tau frames
python smooth_boundaries.py --aligned_poses aligned_k.npy \
    --tau 20 --output smoothed_k.npy

3. Global Gaussian Initialization

# Fuse LiDAR point clouds from all clients into global cloud
python fuse_pointclouds.py --lidar_dirs /path/to/client_*/lidar \
    --output global_cloud.ply

# Initialize ~6e5 Gaussians from fused cloud
# Centers from point positions, scales from KNN(3), colors from RGB
python init_gaussians.py --cloud global_cloud.ply \
    --images /path/to/all/images --num_gaussians 600000 \
    --output init_gaussians.ply

4. Federated Training Loop

# Pseudocode for the federated training loop
for round in range(R):  # R=7
    # Server broadcasts global appearance params to all clients
    for k in range(K):
        client_k.receive(global_theta_app)

    # Parallel local optimization (T=1000 steps each)
    for k in range(K):
        local_theta_app_k, visibility_counts_k = client_k.optimize(
            steps=T, loss_lambda=0.2, freeze_positions=True
        )

    # Server aggregates with visibility-aware weighting
    global_theta_app = aggregate(
        [local_theta_app_k for k in range(K)],
        [visibility_counts_k for k in range(K)],
        eps=1e-8
    )

5. Evaluation

# Evaluate aggregated model on union of all validation sets (every 8th frame)
python evaluate.py --gaussians final_gaussians.ply \
    --images /path/to/all/val_images --poses smoothed_poses.npy

Notes

Core insight: 3DGS’s explicit parameter representation (positions, covariances, colors as separable tensors) enables geometry-appearance decoupling that is impossible in implicit methods like NeRF. This is the key architectural advantage over FedNeRF.
Frozen geometry is the main tradeoff: eliminates geometric drift but prevents the model from correcting residual pose errors after initialization. ADC is disabled, so the fixed scaffold must provide sufficient geometric coverage upfront.
No comparison baselines yet: The paper lacks direct comparison to centralized 3DGS, Fed3DGS, or other multi-agent reconstruction methods. The authors note “further evaluation against related baselines will help position the method more clearly.”
Dataset limitation: MeanGreen is indoor corridors only with sparse texture and limited FOV. A more comprehensive version with indoor+outdoor sequences is planned. Public release is pending.
Partitioning sensitivity: Performance depends heavily on how observations are split across clients, especially at boundaries with large appearance/motion changes. The boundary smoothing ($\tau$-frame window) partially mitigates this but is insufficient for severe cases.
Communication overhead: Only appearance parameters and visibility counts are transmitted (no images or raw data). With $M = 6 \times 10^5$ Gaussians, each transmitting $\theta_{\text{app}}$ per round, the communication payload is substantial but fixed regardless of image resolution — a favorable scaling property.
Relevance to multi-agent world modeling: F3DGS demonstrates that photorealistic 3D scene representations can be built collaboratively without raw data exchange, directly applicable to multi-robot mapping and autonomous driving fleet scenarios. The visibility-aware aggregation mechanism is a general technique for handling partial observability in distributed representation learning.