2026-04-05
Bridging Large-Model Reasoning and Real-Time Control via Agentic Fast-Slow Planning
Jiayi Chen, Shuai Wang, Guangxu Zhu, Chengzhong Xu
problem
Large foundation models bring powerful semantic reasoning to autonomous driving, but mapping high-level intent to reliable real-time control remains unsolved. Two dominant approaches both have fundamental limitations:
- Direct LLM trajectory generation (Trajectory-LLM, DriveVLM): outputs are brittle, hard to verify, and struggle with real-time control rates (~10 Hz).
- LLM-to-MPC objective tuning (LanguageMPC, MPCxLLM, System 1.x, LVM-MPC): mixes slow deliberation with fast control, blurring the interface between reasoning and actuation; LLMs are reduced to one-shot hyperparameter updates.
Both routes ignore scale-aware decoupling: perception, reasoning, planning, and control operate at fundamentally different timescales, and blurring them underutilizes learning–optimization complementarity.
The key gap is the LLM-to-planner interface: classical A* is well-studied at its output (to MPC), but there is no principled way to inject LLM-derived symbolic directives into search-based planners without brittle hard constraints or loss of geometric feasibility.
architecture
AFSP is a three-layer hierarchical framework inspired by dual-process cognition:
Layer 1: Perception2Decision (Edge + Cloud)
VLM-based Topology Detector (on-vehicle, edge): Compresses front-camera RGB into an ego-centric topology graph $G = {z_j}$ where each node $z_j = (b_j, d_j, \phi_j, t_j)$ stores bounding box, ego-polar distance, orientation, and semantic class. Ego-polar coordinates are obtained via rotation: $(d_j, \phi_j) = R(-\theta_{\text{ego}})(X_j - X_{\text{ego}}, Y_j - Y_{\text{ego}})$. Distances quantized to $0.1$ m, orientations to $0.5°$. Built on InternVL-2B (2B params).
LLM-based Decision Maker (cloud): Consumes serialized topology graph, outputs a compact driving plan: short sequence of discrete directives $\mathbf{A} = (a_1, \ldots, a_T)$ with $a_t \in {\text{left}, \text{right}, \text{keep}}$ plus a driving style mapped to reference velocity. Uses structured prompting with role description, in-context examples, and chain-of-thought reasoning that enumerates obstacles, assigns risk levels, and derives safe directives. Backend: DeepSeek-V3 via API. Average inference latency: 4.13 s (vs. 10.24 s for direct VLM-based reasoning).
Layer 2: Decision2Trajectory (Planner)
Semantic-Guided A*: Extends classical A* by augmenting state with $(m_{\text{prev}}, k)$ where $m_{\text{prev}} \in {F, FL, FR}$ is the previous move label and $k$ tracks progress through the LLM directive sequence. Each transition is classified via translation logic $\Phi$ into one of four alignment categories:
| Category | Semantic cost |
|---|---|
| Correct (realizes $a_k$) | $C_{\text{CORR}} < 0$ (reward) |
| Delay (doesn’t realize $a_k$) | $C_{\text{DELAY}} \geq 0$ (mild penalty) |
| Wrong (contradicts $a_k$) | $C_{\text{WRONG}} \gg 0$ (strong penalty) |
| Overact (repeats $a_k$) | $C_{\text{OVER}} \geq 0$ (mild penalty) |
Accumulated cost update: \(g(s') = g(s) + c_{\text{geom}}(s, s') + c_{\text{sem}}(m_{\text{prev}}, m_{\text{curr}}, a_k)\)
Heuristic $h$ retains classical geometric potential field. A valid solution requires $k = T$ at the goal, ensuring all directives are completed. State space remains polynomial.
Agentic Refinement Module: A closed-loop agent that automates hyperparameter tuning. Uses three tools: (1) SelectRefHyperparams — warm-starts from cloud memory of similar past scenes; (2) AStarPathGenerate — runs semantic-guided A* and reports structured metrics (trial index, semantic-cost trigger location, distance to obstacle); (3) SaveScene — logs successful $(s, g, \theta^*)$ triplets for continual learning. Implemented with DeepSeek-V3 + LangChain. Default initial config: $[-5, 1, 5, 0.8]$.
Layer 3: Cloud-Guided MPC (Control)
Switching-based MPC with binary indicator $z_t \in {0, 1}$: \(s^*_{t+h} = (1 - z_t)\, \tau^{\text{local}}_{t+h} + z_t\, \tau^{\text{cloud}}_{t+h}, \quad h \in \mathcal{H}\)
Dynamics: $s_{t+1} = A_t s_t + B_t u_t + c_t$ with safety constraint $\text{dist}(G(s_t), O_m) \geq d_0, \; \forall m$.
Stage cost: $\sum_{h=0}^{H} \lVert s_{t+h} - s^*{t+h} \rVert^2 + \lambda \sum{h=0}^{H-1} \lVert u_{t+h} \rVert^2$.
Prediction horizon $H = 15$, time step $\Delta t = 0.2$ s, control rate 10 Hz. Uses the RDA MPC controller with fixed weights $w_s = 0.37$, $w_u = 0.2$.
training
VLM Topology Detector
- Base model: InternVL-2B (2B parameters)
- Training data: 50,000 CARLA frames (49 object categories) collected from Town2–7 and Town10HD; ~200,000 total RGB images collected, filtered for informative frames. 1,000-frame test set with uniform sampling. Annotations include bounding boxes + traffic topology graphs serialized as QA pairs.
- Fine-tuning strategy: Two-stage progressive adaptation
- Stage 1: Freeze language backbone, fine-tune vision encoder + projection layers (for stable grounding)
- Stage 2: Unfreeze all parameters, joint fine-tuning
- Training objective: Next-token prediction: $\mathcal{L}(\theta) = -\sum_{t=1}^{T} \log p_\theta(y_t \mid I, y_{<t})$
- Hardware: 4× A100 GPUs
- Hyperparameters: 12 epochs, learning rate $8 \times 10^{-5}$, best checkpoint selected
- Key results (Two-Stage vs alternatives):
- BBox IoU: 93.0% (vs. 92.8% ViT-only, 93.5% Full-Param)
- Category error: 0.04% (best, vs. 0.13%/0.08%)
- Distance error: 1.31% (best, vs. 1.34%/1.41%)
LLM Decision Maker
- No training required; uses DeepSeek-V3 via API with structured prompting (role + in-context examples + CoT)
- Evaluation: 200 frames → top 30% = 48 decision-worthy scenarios; average match score with VLM = 0.73; latency 4.13 s vs. 10.24 s for Qwen2-VL-chat
Agentic Refinement
- No model training; prompt-driven closed-loop refinement via DeepSeek-V3 + LangChain
- Cloud memory accumulates $(s, g, \theta^*)$ triplets over time
evaluation
All experiments in CARLA simulator, target speed 4.2 m/s, 7 obstacles per scenario (3 vehicles + 4 static objects), vehicle inflation factor 1.1, grid resolution 3 m × 3 m.
Experiment 1: Perception2Decision
VLM topology detection achieves 93.0% IoU, 0.04% category error on the 1,000-frame test set. LLM Decision Maker achieves 0.73 average match score with VLM-based decisions at 2.5× lower latency.
Experiment 2: Semantic-Guided A* Robustness
Tested under three perturbation settings: Shift 1 (original map), Shift 2 (+0.5 m left), Shift 3 (+0.5 m left, +1.0 m up). Vanilla A* drifts under perturbation; semantic-guided A* preserves left/keep/right patterns across all shifts except G1 under Shift 3 (accumulated drift). G2 and G3 remain fully intent-consistent.
Experiment 3: Full AFSP in CARLA (3 scenarios, 10 runs each)
| Scenario | Method | Finish Time (s) | Avg Lat Dev (m) | Max Lat Dev (m) | Speed Var (m/s) |
|---|---|---|---|---|---|
| 1 (nominal) | MPC | 17.04 | 2.00 | 6.30 | 3.01 |
| 1 (nominal) | A*-MPC | 17.26 | 1.92 | 6.22 | 2.13 |
| 1 (nominal) | AFSP | 14.85 | 1.21 | 3.72 | 1.56 |
| 2 (2m x-shift) | MPC | 16.80 | 1.89 | 6.26 | 2.81 |
| 2 (2m x-shift) | A*-MPC | 16.63 | 0.98 | 4.17 | 1.55 |
| 2 (2m x-shift) | AFSP | 15.02 | 1.01 | 3.23 | 1.30 |
| 3 (1.5m x + 1m y) | MPC | 16.90 | 1.77 | 6.25 | 2.70 |
| 3 (1.5m x + 1m y) | A*-MPC | 16.62 | 1.70 | 5.92 | 2.46 |
| 3 (1.5m x + 1m y) | AFSP | 14.73 | 1.02 | 3.42 | 1.48 |
Summary improvements (AFSP averages):
- Finish time: ~12% reduction vs. MPC, ~11% vs. A*-MPC
- Max lateral deviation: ~45% reduction vs. MPC, ~35% vs. A*-MPC
- 100% success rate across all methods and scenarios
reproduction guide
Prerequisites
- CARLA 0.9.13+ simulator (Town2–7, Town10HD maps)
- Python 3.10+, PyTorch 2.x
- 4× A100 GPUs (for VLM fine-tuning)
- DeepSeek-V3 API access (for LLM Decision Maker and Agentic Refinement)
- LangChain (for agentic refinement loop)
Step 1: VLM Topology Detector Training
# Clone repo
git clone https://github.com/cjychenjiayi/icra2026_AFSP
cd icra2026_AFSP
# Download/preprocess CARLA dataset (50K train, 1K test frames)
python data/collect_carla.py --towns 2,3,4,5,6,7,10HD --num-frames 200000
python data/filter_curate.py --keep 50000 --test 1000
# Fine-tune InternVL-2B with two-stage strategy
python train_vlm.py \
--base-model internvl/internvl2-2b \
--train-data data/train_topology_qa.json \
--test-data data/test_topology_qa.json \
--stage1-epochs 6 --stage2-epochs 6 \
--lr 8e-5 --gpus 4 --batch-size 32
Step 2: LLM Decision Maker Setup
# Configure structured prompt with role, in-context examples, and CoT
# DeepSeek-V3 via API; input is serialized topology graph
python llm_decision.py --model deepseek-v3 --api-key $DEEPSEEK_KEY
Step 3: Semantic-Guided A* Planning
# Run with semantic guidance from LLM directives
python plan_semantic_astar.py \
--map scenario_map.pgm \
--directives directives.json \
--costs -5,1,5,0.8 # C_CORR, C_DELAY, C_WRONG, C_OVER
Step 4: Agentic Refinement Module
# Closed-loop refinement with LangChain + DeepSeek-V3
python agentic_refine.py \
--scene scene_config.yaml \
--max-retries 5 \
--memory-dir cloud_memory/
Step 5: Cloud-Guided MPC Execution
# Run full AFSP stack in CARLA at 10 Hz
python run_afsp.py \
--scenario carla_scenario \
--target-speed 4.2 \
--horizon 15 --dt 0.2 \
--ws 0.37 --wu 0.2 \
--inflation 1.1
Key Configuration Notes
- VLM runs on-vehicle (edge); LLM and refinement module run in cloud
- Default semantic costs: $C_{\text{CORR}} = -5$, $C_{\text{DELAY}} = 1$, $C_{\text{WRONG}} = 5$, $C_{\text{OVER}} = 0.8$
- Grid resolution: 3 m × 3 m, lane width ~3 m
- MPC uses the RDA controller ([Han et al., 2023]) with fixed weights
- Match score evaluation requires Hungarian assignment between VLM and LLM action sequences
notes
- Novelty claim: first principled extension of A* incorporating language-derived semantics as soft costs with polynomial state-space complexity, establishing a formal interface between symbolic reasoning and graph search.
- Design philosophy: scale-aware decoupling — each layer operates at its natural rate with explicit interpretable interfaces (topology graphs → symbolic directives → reference trajectories → control actions).
- Edge–cloud split: VLM topology detector runs on-vehicle (compressed representation); LLM decision maker + agentic refinement run in cloud. This reduces bandwidth (topology graph vs. raw images) while keeping perception latency local. LLM achieves 0.73 match score with VLM at 2.5× lower latency.
- Soft costs vs. hard constraints: encoding LLM directives as soft costs (not hard constraints) is critical — it biases search toward intent-consistent paths while allowing recovery when directives conflict with geometry or safety.
- Comparison landscape: LanguageMPC/MPCxLLM reduce LLMs to one-shot updates; System 1.x gates deliberation but lacks principled planner interface; DriveVLM couples perception with reasoning but struggles with control rates; EPSILON handles corridors but depends on hand-crafted costs. AFSP is the first to combine all three: LLM-informed planning, agentic hyperparameter tuning, and switching MPC.
- Limitations: evaluation is simulation-only (CARLA) with relatively simple scenarios (7 obstacles, straight corridors); no real-world validation. LLM decision quality (0.73 match score) suggests room for improvement. The framework assumes reliable VLM topology detection; robustness to perception failures is not evaluated.
- Future directions stated: adaptive edge–cloud collaboration, verification of symbolic directives for practical deployment.
- Also note: two authors are listed as corresponding (Shuai Wang, Guangxu Zhu); the paper includes a fourth author (Chengzhong Xu) not in the initial task description.