Bridging Large-Model Reasoning and Real-Time Control via Agentic Fast-Slow Planning

problem

Large foundation models bring powerful semantic reasoning to autonomous driving, but mapping high-level intent to reliable real-time control remains unsolved. Two dominant approaches both have fundamental limitations:

Direct LLM trajectory generation (Trajectory-LLM, DriveVLM): outputs are brittle, hard to verify, and struggle with real-time control rates (~10 Hz).
LLM-to-MPC objective tuning (LanguageMPC, MPCxLLM, System 1.x, LVM-MPC): mixes slow deliberation with fast control, blurring the interface between reasoning and actuation; LLMs are reduced to one-shot hyperparameter updates.

Both routes ignore scale-aware decoupling: perception, reasoning, planning, and control operate at fundamentally different timescales, and blurring them underutilizes learning–optimization complementarity.

The key gap is the LLM-to-planner interface: classical A* is well-studied at its output (to MPC), but there is no principled way to inject LLM-derived symbolic directives into search-based planners without brittle hard constraints or loss of geometric feasibility.

architecture

AFSP is a three-layer hierarchical framework inspired by dual-process cognition:

Layer 1: Perception2Decision (Edge + Cloud)

VLM-based Topology Detector (on-vehicle, edge): Compresses front-camera RGB into an ego-centric topology graph $G = {z_j}$ where each node $z_j = (b_j, d_j, \phi_j, t_j)$ stores bounding box, ego-polar distance, orientation, and semantic class. Ego-polar coordinates are obtained via rotation: $(d_j, \phi_j) = R(-\theta_{\text{ego}})(X_j - X_{\text{ego}}, Y_j - Y_{\text{ego}})$. Distances quantized to $0.1$ m, orientations to $0.5°$. Built on InternVL-2B (2B params).

LLM-based Decision Maker (cloud): Consumes serialized topology graph, outputs a compact driving plan: short sequence of discrete directives $\mathbf{A} = (a_1, \ldots, a_T)$ with $a_t \in {\text{left}, \text{right}, \text{keep}}$ plus a driving style mapped to reference velocity. Uses structured prompting with role description, in-context examples, and chain-of-thought reasoning that enumerates obstacles, assigns risk levels, and derives safe directives. Backend: DeepSeek-V3 via API. Average inference latency: 4.13 s (vs. 10.24 s for direct VLM-based reasoning).

Layer 2: Decision2Trajectory (Planner)

Semantic-Guided A*: Extends classical A* by augmenting state with $(m_{\text{prev}}, k)$ where $m_{\text{prev}} \in {F, FL, FR}$ is the previous move label and $k$ tracks progress through the LLM directive sequence. Each transition is classified via translation logic $\Phi$ into one of four alignment categories:

Category	Semantic cost
Correct (realizes $a_k$)	$C_{\text{CORR}} < 0$ (reward)
Delay (doesn’t realize $a_k$)	$C_{\text{DELAY}} \geq 0$ (mild penalty)
Wrong (contradicts $a_k$)	$C_{\text{WRONG}} \gg 0$ (strong penalty)
Overact (repeats $a_k$)	$C_{\text{OVER}} \geq 0$ (mild penalty)

Accumulated cost update: $g(s') = g(s) + c_{\text{geom}}(s, s') + c_{\text{sem}}(m_{\text{prev}}, m_{\text{curr}}, a_k)$

Heuristic $h$ retains classical geometric potential field. A valid solution requires $k = T$ at the goal, ensuring all directives are completed. State space remains polynomial.

Agentic Refinement Module: A closed-loop agent that automates hyperparameter tuning. Uses three tools: (1) SelectRefHyperparams — warm-starts from cloud memory of similar past scenes; (2) AStarPathGenerate — runs semantic-guided A* and reports structured metrics (trial index, semantic-cost trigger location, distance to obstacle); (3) SaveScene — logs successful $(s, g, \theta^*)$ triplets for continual learning. Implemented with DeepSeek-V3 + LangChain. Default initial config: $[-5, 1, 5, 0.8]$.

Layer 3: Cloud-Guided MPC (Control)

Switching-based MPC with binary indicator $z_t \in {0, 1}$: $s^*_{t+h} = (1 - z_t)\, \tau^{\text{local}}_{t+h} + z_t\, \tau^{\text{cloud}}_{t+h}, \quad h \in \mathcal{H}$

Dynamics: $s_{t+1} = A_t s_t + B_t u_t + c_t$ with safety constraint $\text{dist}(G(s_t), O_m) \geq d_0, \; \forall m$.

Stage cost: $\sum_{h=0}^{H} \lVert s_{t+h} - s^*{t+h} \rVert^2 + \lambda \sum{h=0}^{H-1} \lVert u_{t+h} \rVert^2$.

Prediction horizon $H = 15$, time step $\Delta t = 0.2$ s, control rate 10 Hz. Uses the RDA MPC controller with fixed weights $w_s = 0.37$, $w_u = 0.2$.

training

VLM Topology Detector

Base model: InternVL-2B (2B parameters)
Training data: 50,000 CARLA frames (49 object categories) collected from Town2–7 and Town10HD; ~200,000 total RGB images collected, filtered for informative frames. 1,000-frame test set with uniform sampling. Annotations include bounding boxes + traffic topology graphs serialized as QA pairs.
Fine-tuning strategy: Two-stage progressive adaptation
- Stage 1: Freeze language backbone, fine-tune vision encoder + projection layers (for stable grounding)
- Stage 2: Unfreeze all parameters, joint fine-tuning
Training objective: Next-token prediction: $\mathcal{L}(\theta) = -\sum_{t=1}^{T} \log p_\theta(y_t \mid I, y_{<t})$
Hardware: 4× A100 GPUs
Hyperparameters: 12 epochs, learning rate $8 \times 10^{-5}$, best checkpoint selected
Key results (Two-Stage vs alternatives):
- BBox IoU: 93.0% (vs. 92.8% ViT-only, 93.5% Full-Param)
- Category error: 0.04% (best, vs. 0.13%/0.08%)
- Distance error: 1.31% (best, vs. 1.34%/1.41%)

LLM Decision Maker

No training required; uses DeepSeek-V3 via API with structured prompting (role + in-context examples + CoT)
Evaluation: 200 frames → top 30% = 48 decision-worthy scenarios; average match score with VLM = 0.73; latency 4.13 s vs. 10.24 s for Qwen2-VL-chat

No model training; prompt-driven closed-loop refinement via DeepSeek-V3 + LangChain
Cloud memory accumulates $(s, g, \theta^*)$ triplets over time

evaluation

All experiments in CARLA simulator, target speed 4.2 m/s, 7 obstacles per scenario (3 vehicles + 4 static objects), vehicle inflation factor 1.1, grid resolution 3 m × 3 m.

Experiment 1: Perception2Decision

VLM topology detection achieves 93.0% IoU, 0.04% category error on the 1,000-frame test set. LLM Decision Maker achieves 0.73 average match score with VLM-based decisions at 2.5× lower latency.

Experiment 2: Semantic-Guided A* Robustness

Tested under three perturbation settings: Shift 1 (original map), Shift 2 (+0.5 m left), Shift 3 (+0.5 m left, +1.0 m up). Vanilla A* drifts under perturbation; semantic-guided A* preserves left/keep/right patterns across all shifts except G1 under Shift 3 (accumulated drift). G2 and G3 remain fully intent-consistent.

Experiment 3: Full AFSP in CARLA (3 scenarios, 10 runs each)

Scenario	Method	Finish Time (s)	Avg Lat Dev (m)	Max Lat Dev (m)	Speed Var (m/s)
1 (nominal)	MPC	17.04	2.00	6.30	3.01
1 (nominal)	A*-MPC	17.26	1.92	6.22	2.13
1 (nominal)	AFSP	14.85	1.21	3.72	1.56
2 (2m x-shift)	MPC	16.80	1.89	6.26	2.81
2 (2m x-shift)	A*-MPC	16.63	0.98	4.17	1.55
2 (2m x-shift)	AFSP	15.02	1.01	3.23	1.30
3 (1.5m x + 1m y)	MPC	16.90	1.77	6.25	2.70
3 (1.5m x + 1m y)	A*-MPC	16.62	1.70	5.92	2.46
3 (1.5m x + 1m y)	AFSP	14.73	1.02	3.42	1.48

Summary improvements (AFSP averages):

Finish time: ~12% reduction vs. MPC, ~11% vs. A*-MPC
Max lateral deviation: ~45% reduction vs. MPC, ~35% vs. A*-MPC
100% success rate across all methods and scenarios

reproduction guide

Prerequisites

CARLA 0.9.13+ simulator (Town2–7, Town10HD maps)
Python 3.10+, PyTorch 2.x
4× A100 GPUs (for VLM fine-tuning)
DeepSeek-V3 API access (for LLM Decision Maker and Agentic Refinement)
LangChain (for agentic refinement loop)

Step 1: VLM Topology Detector Training

# Clone repo
git clone https://github.com/cjychenjiayi/icra2026_AFSP
cd icra2026_AFSP

# Download/preprocess CARLA dataset (50K train, 1K test frames)
python data/collect_carla.py --towns 2,3,4,5,6,7,10HD --num-frames 200000
python data/filter_curate.py --keep 50000 --test 1000

# Fine-tune InternVL-2B with two-stage strategy
python train_vlm.py \
  --base-model internvl/internvl2-2b \
  --train-data data/train_topology_qa.json \
  --test-data data/test_topology_qa.json \
  --stage1-epochs 6 --stage2-epochs 6 \
  --lr 8e-5 --gpus 4 --batch-size 32

Step 2: LLM Decision Maker Setup

# Configure structured prompt with role, in-context examples, and CoT
# DeepSeek-V3 via API; input is serialized topology graph
python llm_decision.py --model deepseek-v3 --api-key $DEEPSEEK_KEY

Step 3: Semantic-Guided A* Planning

# Run with semantic guidance from LLM directives
python plan_semantic_astar.py \
  --map scenario_map.pgm \
  --directives directives.json \
  --costs -5,1,5,0.8  # C_CORR, C_DELAY, C_WRONG, C_OVER

# Closed-loop refinement with LangChain + DeepSeek-V3
python agentic_refine.py \
  --scene scene_config.yaml \
  --max-retries 5 \
  --memory-dir cloud_memory/

Step 5: Cloud-Guided MPC Execution

# Run full AFSP stack in CARLA at 10 Hz
python run_afsp.py \
  --scenario carla_scenario \
  --target-speed 4.2 \
  --horizon 15 --dt 0.2 \
  --ws 0.37 --wu 0.2 \
  --inflation 1.1

Key Configuration Notes

VLM runs on-vehicle (edge); LLM and refinement module run in cloud
Default semantic costs: $C_{\text{CORR}} = -5$, $C_{\text{DELAY}} = 1$, $C_{\text{WRONG}} = 5$, $C_{\text{OVER}} = 0.8$
Grid resolution: 3 m × 3 m, lane width ~3 m
MPC uses the RDA controller ([Han et al., 2023]) with fixed weights
Match score evaluation requires Hungarian assignment between VLM and LLM action sequences

notes

Novelty claim: first principled extension of A* incorporating language-derived semantics as soft costs with polynomial state-space complexity, establishing a formal interface between symbolic reasoning and graph search.
Design philosophy: scale-aware decoupling — each layer operates at its natural rate with explicit interpretable interfaces (topology graphs → symbolic directives → reference trajectories → control actions).
Edge–cloud split: VLM topology detector runs on-vehicle (compressed representation); LLM decision maker + agentic refinement run in cloud. This reduces bandwidth (topology graph vs. raw images) while keeping perception latency local. LLM achieves 0.73 match score with VLM at 2.5× lower latency.
Soft costs vs. hard constraints: encoding LLM directives as soft costs (not hard constraints) is critical — it biases search toward intent-consistent paths while allowing recovery when directives conflict with geometry or safety.
Comparison landscape: LanguageMPC/MPCxLLM reduce LLMs to one-shot updates; System 1.x gates deliberation but lacks principled planner interface; DriveVLM couples perception with reasoning but struggles with control rates; EPSILON handles corridors but depends on hand-crafted costs. AFSP is the first to combine all three: LLM-informed planning, agentic hyperparameter tuning, and switching MPC.
Limitations: evaluation is simulation-only (CARLA) with relatively simple scenarios (7 obstacles, straight corridors); no real-world validation. LLM decision quality (0.73 match score) suggests room for improvement. The framework assumes reliable VLM topology detection; robustness to perception failures is not evaluated.
Future directions stated: adaptive edge–cloud collaboration, verification of symbolic directives for practical deployment.
Also note: two authors are listed as corresponding (Shuai Wang, Guangxu Zhu); the paper includes a fourth author (Chengzhong Xu) not in the initial task description.