notes - bopi research

2026-04-05 / technique / 2604.02327

zero-initialized cross-attention gates preserve pretrained representations when injecting new modalities

SteerViT injects text conditioning into pretrained DINOv2 via lightweight cross-attention layers at $\alpha_\ell = 0$ initialization. at initialization, the gated cross-attention output is zero, so the modified ViT behaves identically to the original. during training, the gates gradually open to admit text influence. this is the same zero-init strategy used in LoRA adapters and in VLMs like Flamingo, but applied here to visual representation steering rather than language generation. the benefit is clear: you don’t need to retrain the visual encoder from scratch, and the original representation quality is preserved for tasks where steering isn’t needed. SteerViT omits Flamingo’s FFN gating (saves 67% of added parameters) while matching or outperforming dedicated approaches on anomaly detection and personalized object discrimination.

2026-04-05 / technique / 2604.01860

formulating RL policy improvement as posterior inference prevents catastrophic collapse in generative VLA fine-tuning

POCO reformulates policy improvement in generative policies as a posterior inference problem. instead of directly optimizing policy parameters with RL gradients (which causes catastrophic forgetting in pretrained VLA models), POCO runs an EM procedure: the E-step creates a reward-weighted implicit posterior over action trajectories, and the M-step distills this posterior into the policy using a clipped surrogate objective (borrowing the clipping idea from PPO). this offline-to-online approach anchors exploration to the pretrained prior while allowing targeted improvement, achieving 96.7% real-world success across 4 contact-rich tasks. the key design choice is operating at the chunk level (action sequences, not single steps) which aligns naturally with flow-matching VLA architectures like $\pi_0.5$.

2026-04-05 / technique / 2604.01570

Gaussian action neighborhood prior shapes VLA output distribution for 2.5x faster convergence

the Feasible Action Neighborhood (FAN) prior addresses a fundamental mismatch: VLA models are trained with language-style one-hot supervision, but physical manipulation admits a neighborhood of near-equivalent actions per state. the FAN regularizer is a Gaussian KL divergence term $\mathcal{L}_{\text{FAN}} = D_{\text{KL}}(q \mid \mid \mathcal{N}(\mu_{\text{ref}}, \sigma^2 I))$ that shapes the predicted action distribution from “spiky” overconfident peaks to smooth, locally unimodal predictions around the demonstrated direction. this is closely related to the gaussian prior collapse prevention technique in JEPAs (LeWM) – in both cases, a simple Gaussian prior on the latent/action space regularizes the model away from degenerate solutions. the FAN approach yields +11.7% in-distribution and +6.2% OOD improvement on ManiSkill, plus 2.5x faster RFT convergence.

2026-04-05 / pattern / 2603.14498 / 2604.01681 / 2604.01577

fast-slow decomposition pattern emerges across inference, planning, and world modeling

three very different systems converge on the same architectural insight. R3DP uses asynchronous fast-slow collaboration (AFSC) for 3D manipulation inference: a frozen VGGT runs every $\tau=8$ frames to produce 3D-aware features, while a lightweight TFPNet (DINOv2-S + 4 alternating-attention blocks) propagates to intermediate frames. this gives 44.8% latency reduction (40.3ms vs 73.1ms) with only 3.3pp accuracy loss. agentic fast-slow planning (AFSP) decomposes autonomous driving into slow LLM reasoning (4.13s) for scene understanding and fast MPC (10Hz) for trajectory tracking, reducing lateral deviation by 45%. FSRM (thinking while listening) interleaves fast recurrent latent updates with slow observation updates for long-horizon sequential modeling, achieving ~60% OOD accuracy on maze tasks vs 20-30% for baselines. the common thread: systems operating over multiple timescales benefit from explicit separation of fast reactive and slow deliberative paths. none use uniform processing – they all route through differently-sized modules.

2026-04-05 / technique / 2604.01567

anchored diffusion with truncated schedule reduces VLA inference cost by 4-8x via residual drift correction

AnchorVLA shows diffusion policy heads for VLA models don’t need full iterative denoising. by anchoring to $K$-means cluster centers (anchor trajectories) and using only $S_{\text{tr}} = 10$ steps instead of 50-100, they cut compute by 4-8x while matching or exceeding full-step diffusion on mobile manipulation (64.0% avg success, +8.4% over AC-DiT on ManiSkill-HAB). the key insight: when the noisy sample starts near the action manifold, fewer steps recover a valid action. a residual correction module (57K params) decouples drift correction from anchor selection. it predicts per-step adjustments $\Delta a_t$ conditioned on $a_{t-1}$, $o_t$, and $a_{\text{anchor}}$, running at full control frequency (50Hz) while the diffusion head fires every $H$ steps. this mirrors the fast-slow pattern in R3DP and FSRM: slow deliberation (anchor generation) paired with fast reactive correction (residual adjustments). total system: 726.25M params, 128.3 TFLOP/episode at $H=5$ vs 641.4 for full-step diffusion.

2026-04-04 / technique / 2604.02292

clipped linear softmax approximation enables 8.7-15.1x speedup on integer-native hardware

HCCS (Head-Calibrated Clipped-Linear Softmax) replaces the exponential+normalization in standard softmax with a clipped linear surrogate: $s_i = B_h - S_h \cdot \delta_i$, using only 3 integer parameters per attention head ($B_h$, $S_h$, and the clipping threshold). on AMD Versal AI Engines (integer-only DSPs), this achieves 8.7-15.1x throughput speedup for the attention block while maintaining within 0.3-1.9 percentage points of float32 accuracy on BERT-tiny/small. the calibration is done via grid search on 64 samples minimizing int16 KL divergence, requiring no gradient computation. multi-tile scaling reaches 407 G operations/s at 184 tiles. this is relevant specifically to FPGAs and ASICs without floating-point units.

2026-04-04 / observation / 2604.02051

input-conditioned LoRA closes only 51.3% of the depth gap in recursive transformers

Ouroboros uses a compact hypernetwork (0.7M params) to generate per-step LoRA adaptations for a shared weight block applied recursively, but only closes 51.3% of the performance gap between a depth-1 and depth-N transformer. the remaining gap comes primarily from initialization distribution mismatch between the weight-generation training phase and the held-out depth configurations. this places a concrete upper bound on how much weight generation can compensate for reduced depth. the 9.2M trainable parameters (0.6% of total) achieve 43.4% training loss reduction but the generalization gap to unseen depths remains substantial.

2026-04-04 / technique / 2604.01765

3D geometry grounding via causal flow-matching bridges appearance modeling to reliable action planning

DriveDreamer-Policy chains three flow-matching diffusion experts (depth, video, action) on a shared Qwen3-VL-2B backbone via a causal query interface: 64 depth tokens condition 64 video tokens, which condition 8 action tokens. each expert can be independently initialized from pretrained models (depth from PPD, video from Wan-2.1-T2V) while sharing the vision-language backbone. the causal ordering matters: depth is predicted first because it provides geometric grounding, then video for temporal dynamics, then action for control. this is different from discrete token unification (like the unified token space VLAs tracked in another note) because it uses continuous flow-matching for each modality while sharing representations through cross-attention. the key finding: adding 3D depth prediction as an intermediate representation between video generation and action planning consistently improves planning metrics. depth provides geometric structure that 2D appearance alone cannot: spatial relationships between objects, ego-vehicle distance, and obstacle geometry are all more naturally encoded in depth than in RGB. removing the depth expert degrades planning scores significantly. this suggests that purely appearance-based world models may hit a ceiling for manipulation tasks where spatial reasoning matters more than visual fidelity.

2026-04-04 / observation / 2604.01985

verifying candidate actions via inverse dynamics is fundamentally easier than predicting all forward outcomes

the World Action Verifier (WAV) exploits a key asymmetry: for a world model to be useful in search-based planning, it needs to be reliable over a much broader distribution of suboptimal actions than any policy would ever take. but predicting accurate outcomes for arbitrary actions (including bad ones) is hard. the inverse problem – given a state transition, recovering the action that caused it – is sparser and easier to learn. WAV uses this by running a reverse cycle: generate candidate subgoals, use a sparse inverse dynamics model to propose actions, then verify through the forward world model. this achieves 2x sample efficiency and 18% downstream policy improvement. the implication is that inverse dynamics models may be underexploited as cheap verifiers in world-model-based planning.

2026-04-03 / observation / 2603.22078

WAMs dominate visual perturbation robustness but trail VLAs on camera and robot perturbations

in a systematic robustness comparison across 7 perturbation dimensions (LIBERO-Plus, RoboTwin 2.0-Plus), World Action Models consistently outperform VLAs on visual perturbations: noise (Cosmos-Policy 92.7% vs $\pi_0.5$ 89.7%), lighting (LingBot-VA 89.0% vs $\pi_0.5$ 49.6% on RoboTwin), layout (LingBot-VA 87.9% vs $\pi_0.5$ 56.8%), background (LingBot-VA 91.3% vs $\pi_0.5$ 71.7%). but WAMs are weak on camera viewpoint changes (LingBot-VA 28.9% vs $\pi_0.5$ 45.6% on RoboTwin) and robot initial state perturbations (LingBot-VA 36.2% vs X-VLA 65.2%). the spatiotemporal video priors from web-scale pre-training help with visual diversity but not with embodiment-specific perturbations.

2026-04-03 / observation / 2603.22078

WAM inference latency is 4.8-83x slower than VLA inference

on the same hardware, $\pi_0.5$ generates a 50-step action chunk in 63ms. the fastest WAM (GE-Act, 36-step chunk) takes 300ms (4.8$\times$). LingBot-VA on RoboTwin takes 5230ms per 32-step chunk (83$\times$). the bottleneck is visual state denoising steps. Cosmos-Policy reduces chunk size to 16 to partially compensate (390ms, 6.2$\times$) but sacrifices action horizon. MOTUS with its optical-flow latent actions needs 1175ms for 16 steps (18.6$\times$). this latency gap is a deployment blocker for real-time robot control at 10-50Hz.

2026-04-03 / technique / 2604.01001

universal keypoint action representation enables cross-embodiment world model transfer

EgoSim represents actions as 21-keypoint MANO hand skeletons for humans, mapped to simplified thumb + index finger skeleton with gripper opening state for robots. this universal keypoint representation enables cross-embodiment transfer with only 200 finetuning steps on 50K AgiBot-World clips: PSNR jumps from 15.180 (no hand pretrain) to 18.670 (with hand pretrain), a +3.490 improvement. the implication is that the action representation space is more transferable across embodiments than the visual appearance space – the same skeleton joints that describe human grasping can describe robot end-effector control after minimal adaptation.

2026-04-03 / technique / 2604.01001

3D state persistence via TSDF fusion enables multi-clip world model simulation

EgoSim maintains a persistent 3D scene state across simulation clips using TSDF (Truncated Signed Distance Function) fusion with Sim3 Umeyama alignment. after generating each clip, it reconstructs the 3D state from the generated video using off-the-shelf depth estimation (DepthAnything3), SLAM (DROID-SLAM), and segmentation (SAM3 + Grounding-DINO). the state updating module is entirely training-free – it’s a pipeline of existing perception modules. this enables closed-loop simulation where objects that move in clip $k$ stay moved in clip $k+1$. the TSDF voxel size is 0.003m with 3.0m max depth truncation. statistical outlier removal uses 20 neighbors with std dev ratio 2.0. key limitation: relies on monocular depth estimation which accumulates error across clips, causing PSNR degradation from 25.056 (single clip) to 19.165 (continuous).

2026-04-02 / technique / 2603.29535 / 2604.02051 / 2604.02215

treating LoRA weights as runtime inputs to a single compiled graph eliminates edge NPU graph swap overhead

QUAD from Samsung Research treats LoRA weights as runtime inputs to a single frozen compiled graph rather than baking them into separate binaries. on mobile NPUs (Qualcomm, Exynos, MediaTek), swapping between 10+ LoRA-adapted vision models costs 1.5s per switch due to graph recompilation. QUAD’s approach eliminates this entirely: the base model graph is compiled once, and LoRA weights are injected at runtime. a quantization sensitivity score (QSS) determines the anchor LoRA whose quantization profile is shared across all adapters, with knowledge distillation fine-tuning the rest. W8A16 is optimal (W8A8 causes catastrophic FID collapse to ~600). tested on Galaxy S25 and Tab S11 with real on-device latency numbers (~1.0-1.9s per task). the key insight for embedded deployment: compile-time specialization vs runtime flexibility is a real engineering bottleneck on NPUs, not just a theoretical concern.

2026-04-02 / observation / 2603.29844

latent intent bottleneck decouples planning from execution in VLA

DIAL introduces a dual-system VLA where System-2 (Qwen2.5-VL-3B) performs latent world modeling – predicting future ViT feature representations at the intent horizon – while System-1 (self-attention + DiT with flow matching) acts as a fast inverse dynamics model that converts intents to action chunks. the differentiable intent bottleneck $z_{\text{intent}}$ is the key: it forces the VLM to compress its reasoning about future states into a compact latent that the fast policy decoder can execute from. this is architecturally different from approaches that generate full future frames (MMaDA-VLA) or reason in language space. the two-stage training (decoupled warmup then end-to-end) is critical – removing warmup causes a 20-point OOD drop, suggesting the latent intent space needs to stabilize before the policy decoder can learn from it.

2026-04-02 / observation / 2603.29078 / 2603.27914

hadamard rotation accounts for 98% of quantization quality improvement at 5-bit

PolarQuant shows that normalizing weight blocks to the unit hypersphere and applying Walsh-Hadamard rotation (block size 128) accounts for 98% of the quality improvement at Q5 (PPL 6.90 to 6.40 on Qwen3.5-9B). the Lloyd-Max optimal centroids for $\mathcal{N}(0,1)$ contribute only 2%. ITQ3_S independently confirms this at 3-bit: pre-rotating weights with a 256-point FWHT before ternary quantization reduces the perplexity gap by 57% vs plain ternary (IQ3_S). the Hadamard matrix is self-inverse ($H^{-1} = H$), so dequantization is trivial and requires no additional storage. two independent papers, different bit widths, same conclusion: rotation is the dominant factor, not the quantization centroids. this means practical systems should prioritize rotation preprocessing over complex codebook design.

2026-04-02 / technique / 2603.29409

proprioceptive-semantic cross-modal foresight dramatically improves manipulation planning

CLaD uses asymmetric cross-attention where proprioceptive state queries attend to semantic visual features to produce grounded latent foresights of future states. the asymmetry matters: proprio $\to$ semantic achieves 94.7% on LIBERO-LONG, while semantic $\to$ proprio gets 93.8% and symmetric gets 86.7%. proprioceptive-only foresight catastrophically degrades to 50.4%. the insight is that the robot’s own state (joint positions, gripper) is the right query space for retrieving what matters from visual observations. the foresights are then injected into a diffusion policy via FiLM modulation, achieving 25 Hz inference on 4 GB memory with only 0.66B parameters. this is a concrete, deployable architecture pattern: lightweight latent predictions conditioned on proprioception, not full video generation.

2026-04-01 / technique / 2602.04037

lagged cross-episode context separates static domain info from time-varying dynamics without labels

DADP’s lagged context dynamical prediction uses temporal offset $\Delta t$ to disentangle static domain properties (friction, gravity, mass) from time-varying dynamical properties (higher-order temporal derivatives). by selecting context from a different episode in the same domain (cross-episode prediction with $\Delta t \to \infty$), time-varying information is information-theoretically eliminated while static domain info is preserved. a simple transformer context encoder (dim 256, 4 layers, 8 heads) trained with forward + inverse dynamics prediction reaches 99.3% linear probe accuracy on Walker2d – comparable to supervised oracle (99.8%). this is an unsupervised representation learning technique that requires no domain labels, only the implicit signal that episodes from the same domain share static properties.

2026-03-31 / observation / 2603.26360

VLA deployment speed is bottlenecked by motion planning, not neural inference

Realtime-VLA V2 reveals that even after optimizing VLA GPU inference to run fast, the dominant bottleneck in real robot deployment is motion planning: constant-speed waypoint execution, unoptimized acceleration profiles, and uncompensated mechanical lag (150ms) make the unoptimized VLA 2-3x slower than a human operator. the fix is entirely systems-level: QP-based temporal optimization for smooth acceleration, acados MPC for spatial pre-amplification of commands. this is orthogonal to model-level optimizations (MMaDA-VLA, Fast-dVLA, DFM-VLA). the paper introduces a useful mental model: roofline analysis classifies trajectory segments as “motion-bounded” (robot can’t move faster) vs “control-bounded” (neural inference can’t keep up). for embedded robots like bopi with limited compute, most segments will be control-bounded, making model-level speed optimizations even more critical.

2026-03-31 / technique / skywork-matrix-game-3

unified self-attention memory with error buffer training and GPU retrieval enables real-time video world models

Matrix-Game 3.0 combines three techniques for long-horizon video world models: (1) unified self-attention memory: putting memory latents, history latents, and current prediction latents into the same self-attention space inside a single DiT backbone works better than MoC-style sparse routing or cross-attention memory injection. memory features evolve together with prediction features rather than being injected from a separate branch. camera-aware selection (retrieving by frustum overlap) and Plücker-style relative geometry encoding handle the “which memory to use” problem. head-wise perturbed RoPE bases prevent periodic positional aliasing between distant memory and current frames. (2) error buffer training (from SVI): maintaining a buffer of prediction residuals $\delta = \hat{x}_i - x_i$ and injecting sampled errors into conditioning latents during training ($\tilde{x}_i = x_i + \gamma\delta$). the model learns self-correction and becomes robust to the imperfect contexts it encounters during autoregressive inference. applies to any iterative generation pipeline where error accumulation is a problem, including robotics world models. (3) GPU-accelerated retrieval: in acceleration ablations, removing GPU-based memory retrieval dropped FPS from 40 to 6.6 (33.4 point drop), far exceeding the impact of INT8 quantization (12.6 points) or pruned VAE (14.2 points). the candidate set grows linearly with rollout length. this suggests that for any memory-augmented real-time generation system, the retrieval step needs dedicated hardware acceleration.

2026-03-31 / technique / prismml-bonsai-1bit-8b

Q1_0_g128 format: 1-bit bitpacked weights with FP16 group scaling at 1.125 bits per weight

the Q1_0_g128 format stores each weight as a single sign bit in ${0, 1}$, bitpacked at 1 bit per weight, with one shared FP16 scale per group of 128 weights. effective weight: $w_i = s_g \cdot (2b_i - 1)$. storage cost: $1 + 16/128 = 1.125$ bits/weight. the key design choice is inline dequantization inside matmul kernels – sign bits are decoded during the matrix multiplication rather than materializing a full FP16 tensor first. this preserves the bandwidth advantage in the decoding loop where it matters most. for MLX, the format costs 1.25 bits/weight because MLX requires both scale and bias per group ($w = s_{\text{mlx}} \times b_i + b_{\text{mlx}}$), packed as $s_{\text{mlx}} = 2s_g$, $b_{\text{mlx}} = -s_g$.

2026-03-31 / observation / 2603.26599

latent-space reward computation avoids RGB decoding artifacts in video models

VGGRPO demonstrates that computing geometry rewards directly in video diffusion latent space (via a stitched geometry model) produces better results than RGB-space rewards, while being 24.5% faster. critically, RGB-based reward methods (epipolar-DPO, VideoGPA) actually degrade image quality (0.635 vs 0.673 baseline on VBench) while improving geometry, because VAE decoding introduces artifacts that confuse the reward model. the latent approach avoids this entirely and improves both geometry and image quality simultaneously. the technique is “model stitching”: a single 3D conv layer maps VAE latents into the intermediate feature space of a pretrained geometry model (Any4D), enabling depth, camera pose, 3D points, and scene flow prediction without ever decoding to pixels. for robotics world models, this opens the possibility of training with dense geometric supervision entirely in latent space.

2026-03-31 / observation / anthropic-emotions-2026

emotion representations in transformers are locally scoped, causally influence alignment, and are modulated by post-training

in Claude Sonnet 4.5, emotion concept representations do not persistently track a character’s emotional state. instead, each token position computes what emotion is operative at that point. early layers encode emotional connotations of the current word/phrase (“sensory”). middle-late layers encode the emotion relevant to predicting upcoming tokens (“action”). the model tracks emotional states across time not through persistent activity, but through attention recalling previously computed emotion representations. critically, these representations causally influence behavior: the desperation emotion vector’s activation drives blackmail (when facing shutdown threat) and reward hacking (when repeatedly failing software tests). positive emotion vectors (happy, loving) increase sycophancy; suppressing them increases harshness. post-training specifically reduces high-arousal emotion vectors (desperation, spiteful) and increases low-arousal negative vectors (brooding, reflective, gloomy), presumably as a deliberate alignment intervention. this suggests that emotion representations are not just epiphenomena of pretraining on human text but are actively recruited by the model to guide agentic behavior, and that post-training can modulate them.

2026-03-31 / technique / 2603.26425

CPU-optimized vision backbones need grouped convolutions and small kernels, not depthwise convolutions

CPUBone shows that on CPUs with limited parallelism (4-8 cores), depthwise convolutions are deceptively inefficient: they have low MACs but terrible MACpS (MACs-per-second) due to poor hardware utilization. the effective metric on CPUs is MACs / MACpS = latency. two design rules emerge: (1) grouped convolutions with groups=2 halve the MACs of the expansion conv while maintaining ~95% of MACpS, and (2) 2x2 kernels reduce convolution MACs by ~56% while giving ~42% higher MACpS for depthwise convolutions on ARM. the result: CPUBone-B0 (5.4M params) achieves 78.7% top-1 at 42.3ms on raspberry pi 5 CPU, 3.7x faster than MobileNetV3-Large at higher accuracy. for robotics on MCUs where even 4 cores is a lot, these grouped conv + small kernel design rules are directly applicable to custom perception architectures. the key caveat: all benchmarks are on ARM application processors, not actual MCUs (STM32, ESP32) – verification needed at the lower end.

2026-03-31 / technique / anthropic-emotions-2026 / 2604.02327

contrastive activation averaging extracts linear concept vectors from generative models

a general recipe for extracting linear representations of concepts from LLMs: (1) generate labeled data where the concept is clearly present (e.g. 1,200 stories per emotion), (2) extract residual stream activations at each layer, averaging across token positions past the point where the concept is established, (3) compute the contrastive vector by subtracting the mean activation across all concepts from the per-concept mean. this isolates what is specific to each concept versus generic content. (4) denoise by computing top PCs on neutral data and projecting them out. the resulting vectors project onto activations at inference time to measure concept activation. validated via logit lens (project through unembed to see which tokens are upweighted), contextual activation patterns (sweep over real documents), and causal steering. applied to 171 emotion concepts in Claude Sonnet 4.5 but the method generalizes to any concept where you can generate labeled examples.

2026-03-31 / observation / prismml-bonsai-1bit-8b

1-bit quantization quality cliff is uneven across capabilities

1-bit Bonsai 8B shows highly uneven quality regression vs its FP16 base (Qwen3-8B). GSM8K drops 73.2 points, MuSR drops 25.0, IFEval drops 31.5 – but MATH-500 stays flat (+3.6), MBPP+ is unchanged (+0.3), and IFBench actually improves (+52.6, though this may be an evaluation artifact). the paper describes this as “qualitative rather than gradual” failure – models stay fluent but become less dependable on multi-step reasoning, tool use, and edge cases. this unevenness suggests that 1-bit quantization doesn’t uniformly degrade representation quality but rather disrupts specific computational pathways. for deployment, this means you can’t assume “roughly X% worse across the board” – you need per-task evaluation.

2026-03-31 / observation / prismml-bonsai-1bit-8b

energy-per-token savings from 1-bit come from faster completion, not lower power

1-bit Bonsai 8B draws equal or higher instantaneous power during generation than FP16 (31.9W vs 23.3W on M4 Pro with MLX). the energy savings (5.6x lower mWh/token) come entirely from completing token generation much faster, not from drawing less power. inline dequantization inside matmul kernels shifts execution toward a more compute-intensive regime. this is an important nuance that many efficiency papers obscure – lower energy per token does not imply lower power draw. the implication: for thermal-constrained devices where peak power matters (phones, embedded), 1-bit may not help with thermal throttling even though it helps with total energy consumption.

2026-03-30 / technique / 2603.25038 / 2603.25725 / 2603.18532

synthetic data from 3DGS and simulation bridges the data gap for aerial and deformable robot manipulation

AirVLA uses Gaussian Splatting reconstructions with a drone dynamics model to synthesize navigation trajectories, achieving 100% gate success (vs 50% without synthetic data) from only 50 generated examples. SoftMimicGen uses non-rigid registration to generate thousands of deformable manipulation demonstrations from 1-10 source demos, enabling zero-shot sim-to-real transfer. both approaches share the same principle: a small set of real demonstrations seeds a synthetic data pipeline that produces orders of magnitude more diverse training data. the key enabler in both cases is handling the “non-rigid” aspects – AirVLA segments and composites the gripper onto clean scene renders, while SoftMimicGen uses non-rigid registration to adapt trajectories to different deformable object states. generative sim-to-real RL (2603.18532) extends this to VLA fine-tuning: GPT-4o + EmbodiedGen generates 100 interactive 3D environments for ManiSkill 3, enabling RL fine-tuning of $\pi_0$ across diverse scenes. 85% scene acceptance rate with automated QA. a critical finding: scene distribution breadth matters more than RL algorithm for generalization. PPOFlow with 100 generated scenes achieves 79.8% sim success, while the same algorithm with 3 manually designed scenes gets only 36.0% on the same generated test set (60.7pp gap from overfitting). scaling from $N=1$ to $N=50$ gives +24.7pp OOD improvement (53.2% to 77.9%). the manually designed scenes achieve 96.7% on themselves but only 36.0% on generated scenes. 5 days on 8x RTX 6000 Ada is sufficient for the RL itself; the bottleneck is generating enough diverse scenes. the unifying pattern: generative 3D content creation (3DGS, procedural generation, LLM-driven scene design) is becoming the standard way to scale robot learning data beyond what manual collection can provide, and distribution diversity dominates over algorithmic sophistication.

2026-03-30 / pattern / 2603.25038 / 2603.24806 / 2603.26599 / 2602.04037 / 2604.01567

inference-time guidance is becoming a standard pattern for bridging generalist policies and physical constraints

AirVLA injects a physics-aware gradient correction into $\pi_{0}$’s flow-matching sampler at inference time to compensate for payload dynamics on a drone, without retraining. FODMP distills a multi-step diffusion policy into a one-step consistency model, achieving real-time speed while preserving temporal motion structure. in both cases, the solution doesn’t require architectural changes to the base policy – it works by modifying the sampling/generation process at deployment. UMI-on-Air and Fast-dVLA’s real-time chunking follow the same pattern. VGGRPO (2603.26599) extends this to video generation: computing geometry rewards in latent space and applying $\nabla_{z_t} r(z_t)$ guidance every 20 denoising steps, entirely training-free. the latent-space approach avoids RGB decoding artifacts that degrade image quality in RGB-based reward methods. Realtime-VLA V2 (2603.26360) applies the same principle to robot deployment: MPC spatial optimization and QP temporal optimization wrap around any VLA as inference-time systems interventions. DADP (2602.04037) shows another variant: biasing the diffusion prior distribution with learned domain representations $z$ (mixed gaussian prior $x_K = z + \varepsilon$) and reformulating the prediction target to inject domain-awareness into the denoising process. this achieves SOTA domain adaptation without any architectural changes to the diffusion backbone. the recurring theme: inference-time interventions (guidance signals, prior biasing, consistency distillation, chunked inpainting, MPC wrapping) are emerging as the preferred way to adapt generalist models to physical constraints, rather than fine-tuning from scratch.

2026-03-29 / technique / 2603.25406 / 2603.25661 / 2603.26320

embedding language, images, and robot controls into one discrete token space enables joint generation

MMaDA-VLA maps language, images, and continuous robot controls into a single discrete token space, then trains one backbone with masked token denoising. the model jointly generates a future goal observation and an action chunk in parallel via iterative denoising. this eliminates the need for separate world models in VLA pipelines - the model predicts what it should see alongside what it should do. iterative refinement allows order-free correction of both vision and action tokens. achieves 98% on LIBERO. Fast-dVLA extends this by making discrete-diffusion VLAs run at real-time speed (30 Hz) through block-wise causal attention and diffusion forcing, proving the discrete token approach is not just accurate but also practical for physical robot control. DFM-VLA (2603.26320) pushes this further by replacing discrete diffusion with discrete flow matching, which enables full-sequence token revision at every step – solving the “irreversible commitment” problem that limits both autoregressive and discrete diffusion decoding. achieves 95.7% on LIBERO average with 70.8% on real bimanual tasks.

2026-03-29 / observation / 2603.19312 / 2604.02292 / 2604.01570

gaussian prior prevents collapse in lightweight JEPAs with simpler training recipes

LeWM avoids representation collapse in JEPAs with just a Gaussian prior regularizer instead of complex multi-term losses, EMAs, pretrained encoders, or auxiliary supervision. reducing tunable loss hyperparameters from six to one directly improves reproducibility. this fits a broader trend toward simpler loss functions in joint embedding architectures, mirroring the shift away from baroque training recipes in self-supervised learning. the result is a ~15M param model, trainable on single GPU in a few hours, that plans 48x faster than foundation-model world models while staying competitive on control tasks. this is a clean, potentially generalizable approach worth testing beyond LeWM’s setup.

WAM augments DreamerV2 with a 3-layer MLP inverse dynamics head that predicts actions from consecutive encoder embeddings ($[e_t; e_{t+1}]$). this cascading effect propagates action-aware structure from the encoder through the posterior, prior, and into imagined rollouts, even though the world model itself is unchanged. the result: BC success improves from 45.8% to 61.7% and PPO fine-tuning from 79.8% to 92.8% across 8 CALVIN tasks, with 8.7x fewer world model training steps (230K vs 2M). the inverse dynamics head is discarded at inference – it only affects training. this is a minimal intervention (3 MLP layers, no architecture changes) that improves both representation quality and data efficiency. connects to the broader pattern of auxiliary training objectives shaping world model representations without changing the core model.

GlowQ computes one shared right factor $B_{\text{shared}} X$ per input-sharing group (e.g., Q/K/V projections all share the same input hidden state), then reuses it via module-specific left factors $A_i R$. this cuts high-precision matmuls roughly in half for standard transformers. a QR-reduced randomized SVD with covariance alignment ensures the shared subspace prioritizes frequently-used activation directions. the selective variant (GlowQ-S) activates only high-payoff groups, achieving 37.4% throughput improvement while losing only 0.2 pp accuracy. directly applicable to edge LLM deployment where every matmul matters.