Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, Xiang Bai et al.

problem

video world models treat environments as static canvases. when dynamic subjects (people, animals, vehicles) move out of the camera frame and re-emerge, current methods produce frozen, distorted, or vanishing subjects. the core issue: most video diffusion models have no persistent memory of object identity across occlusion events. they can generate plausible frames locally but fail to maintain object consistency when subjects leave and re-enter the field of view.

prior approaches to video consistency:

image-level consistency (ID-based methods): inject identity features at inference time, but only work for faces and don’t handle full-body or non-human subjects
temporal attention in transformers: limited by context window length. once a subject leaves the frame, attention weights decay and the model forgets
recurrent memory in video models: some methods use LSTM/GRU states, but these compress all information into a fixed-size vector, losing fine-grained identity details
segmentation-guided generation: track objects via segmentation masks, but require pretrained segmentation models and fail on occlusion

architecture

flowchart TD
    frame[video frame] --> DM[diffusion model]
    frame --> SA[static archivist memory]
    frame --> DT[dynamic tracker memory]
    SA --> BG[background scene tokens]
    DT --> SUB[per-subject identity tokens]
    BG --> HyDRA[HyDRA relevance retrieval]
    SUB --> HyDRA
    HyDRA --> DM
    DM --> out[generated next frame]
    out --> SA
    out --> DT
    
    style HyDRA fill:#c4b8a6,color:#fff
    style DM fill:#b09a84,color:#fff

Hybrid Memory paradigm: the model maintains two separate memory banks with distinct roles:

static archivist memory: compresses and stores background scene structure (walls, floors, furniture). updated slowly, provides spatial grounding.
dynamic tracker memory: maintains per-subject identity tokens that persist across occlusion. updated every frame the subject is visible, retrieved when the subject re-enters.

HyDRA (Hybrid Dynamic Retriever Architecture): the memory module that compresses memory into tokens with spatiotemporal relevance-driven retrieval:

stores scene state as a set of memory tokens ${m_1, m_2, …, m_K}$
each new frame computes relevance scores via cross-attention: $s_i = \text{softmax}(\mathbf{q}_t \cdot \mathbf{m}_i^T / \sqrt{d})$
top-$k$ relevant tokens are selected for selective attention to motion cues
memory tokens are updated via gated merge: $\mathbf{m}_i’ = \mathbf{m}_i + \alpha \cdot g_i \cdot \Delta\mathbf{m}_i$, where $g_i$ is a learned gate

HM-World benchmark dataset: new benchmark designed to evaluate dynamic subject consistency:

$59$K video clips across $17$ scenes
$49$ unique subjects (humans, animals, vehicles)
decoupled camera trajectories and subject trajectories (subjects move independently of camera)
designed exit-entry events where subjects leave and re-enter the frame

training

based on a video diffusion model backbone (details in paper)
trained on existing video datasets augmented with the HM-World data
the hybrid memory module is trained jointly with the generation model
training uses standard diffusion training with added memory consistency losses

evaluation

HM-World benchmark:

significantly outperforms SOTA in dynamic subject consistency
improved overall generation quality compared to baselines
specific metrics include subject identity consistency (face/body matching), temporal coherence, and FVD (fréchet video distance)

qualitative results:

prior methods: subjects freeze, distort, or change identity when re-entering frame
HyDRA: subjects maintain identity, pose, and appearance across occlusion events
background remains stable throughout (benefit of separate static memory)

comparison to baselines: outperforms standard video diffusion models, memory-augmented variants, and segmentation-guided approaches on the occlusion consistency metric.

reproduction guide

no public code repo available yet - check the paper’s project page for updates
the HM-World dataset construction is described in detail in the paper, so you could build a similar benchmark
key ingredients: a base video diffusion model + the HyDRA memory module. the memory module can likely be plugged into existing video diffusion architectures
for a minimal implementation: start with a short video diffusion model, add a small key-value memory store, and train on videos with explicit occlusion events
the decoupled camera/subject motion in HM-World is important - standard video datasets have camera-following-subject bias that doesn’t stress-test occlusion handling

notes

the occlusion problem in video world models is underserved and this paper addresses it head-on. the dual-role memory (archivist for static + tracker for dynamic) is a clean conceptual split that makes intuitive sense.

relevant to robot world models where objects frequently leave camera view during manipulation. a robot that forgets what an occluded object looks like will fail at long-horizon tasks. the tracker memory idea could be adapted for robotic manipulation: maintain per-object state tokens that persist when objects are out of view.

the HM-World benchmark is a useful contribution on its own. most video generation benchmarks don’t specifically test occlusion handling, so this fills a gap.

open questions:

how does the memory scale with many dynamic subjects? is there a combinatorial explosion?
can the tracker memory handle subjects that change appearance (e.g., a person taking off a jacket)?
what’s the inference cost of the memory retrieval? does it add significant latency?
can this be combined with a world model for robotics where object state (position, velocity) is tracked alongside visual identity?