When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making

Jun Liu, Pu Zhao, Zhenglun Kong, Xuan Shen, Peiyan Dong, Fan Yang, Lin Cui, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Gaowen Liu, Yanzhi Wang, Dong Huang et al.

problem

LLM-based embodied agents face a fundamental tension between reasoning quality and computational cost. Methods like SayCan (Ahn et al., 2022), Inner Monologue (Huang et al., 2022), Code as Policies (Liang et al., 2023), and ProgPrompt (Singh et al., 2023) invoke LLM reasoning at nearly every decision step, treating reasoning as a free resource. On real robots — especially resource-constrained embedded platforms — this assumption breaks down. Each LLM query adds $200\text{ms}$ to $5\text{s}$ of latency per invocation depending on model size and hardware, and drains battery on mobile platforms.

Existing approaches use fixed reasoning strategies: always-invoke, periodic invocation, or simple heuristic thresholds (e.g., invoke only when low-level policy confidence drops below $\tau$). These fixed strategies cannot adapt to per-step difficulty. A robot slicing bread needs heavy spatial reasoning; a robot walking to the kitchen does not. Uniform reasoning wastes compute on easy sub-steps and under-spends on hard ones.

The core question: can we learn when to invoke LLM reasoning, which reasoning role to use, and how much compute budget to allocate, as a function of the current state and task? This requires a policy that trades off task success against resource consumption — a natural fit for reinforcement learning.

architecture

RARRL (Resource-Aware Reasoning via Reinforcement Learning) is a hierarchical framework with two levels:

flowchart TD
    obs[observation s_t] --> Orch[orchestrator pi_phi]
    task[task instruction tau] --> Orch
    budget[resource budget b_t] --> Orch
    Orch -->|r_t = 0 skip| Exec[executor low-level skill]
    Orch -->|r_t = 1 reason| Role[reasoning role c_t]
    budget_m[compute budget m_t] --> Role
    Role --> LLM[LLM module]
    LLM --> Exec
    Exec --> action[action]
    action --> env[environment]
    env --> obs
    action --> latency[latency model ell c_t m_t]
    latency --> reward[reward R_task - lambda R_resource]

Orchestrator (high-level policy $\pi_\phi$): At each timestep $t$, the orchestrator observes a state tuple $o_t = (s_t, \tau, b_t, h_t)$ where:

$s_t$ is the current visual/environment observation,
$\tau$ is the natural language task instruction,
$b_t$ is the remaining resource budget (time or energy),
$h_t$ is the history of past orchestration decisions.

The orchestrator outputs an action $a_t = (r_t, c_t, m_t) \in \mathcal{A}$:

$r_t \in {0, 1}$: binary decision to invoke LLM reasoning or skip directly to low-level action,
$c_t \in {1, \ldots, K}$: selection of reasoning role (e.g., planner, verifier, error-recovery), where $K$ is the number of available reasoning modules,
$m_t$: allocated compute budget for the selected reasoning role.

The orchestrator is a small neural network policy (order of magnitude smaller than the LLMs it controls) — intentionally lightweight so it can run continuously on embedded hardware.

Executor (low-level): When $r_t = 0$, the executor applies a pre-trained low-level skill policy $\pi_{\text{skill}}$ directly. When $r_t = 1$, the selected LLM reasoning module processes the observation and task context with budget $m_t$, producing a refined action or plan modification that is then fed to the executor.

Latency model: Each reasoning role $c_k$ has an empirically profiled latency function $\ell_k(m)$ mapping budget $m$ to expected wall-clock time. The total resource consumption over a trajectory is:

\[L = \sum_{t=1}^{T} r_t \cdot \ell_{c_t}(m_t)\]

This latency model is built by benchmarking each LLM reasoning module on the target hardware before RL training begins, capturing the actual inference costs rather than using proxy measures like token count.

Reward function: The reward combines task success and resource efficiency:

\[\mathcal{R}(s_0, a_{0:T}) = \underbrace{\mathbb{1}[\text{goal achieved}]}_{\mathcal{R}_{\text{task}}} - \lambda \cdot \underbrace{\frac{1}{T} \sum_{t=0}^{T} r_t \cdot \ell_{c_t}(m_t)}_{\mathcal{R}_{\text{resource}}}\]

where $\lambda$ is a trade-off hyperparameter controlling how strongly the policy is penalized for resource usage. This formulation directly incentivizes the orchestrator to learn a sparse reasoning pattern: invoke LLM reasoning only when the expected marginal gain in task success justifies the latency cost.

training

Environment: ALFRED (AI2-THOR household tasks). Training uses the full ALFRED training split of $\sim$21K episodes across 6 task types (pick_and_place, pick_clean_then_place, pick_heat_then_place, pick_cool_then_place, pick_two_obj_and_place, look_at_obj) in 4 room types (kitchens, living rooms, bedrooms, bathrooms).

RL algorithm: The orchestrator policy $\pi_\phi$ is trained with PPO (Proximal Policy Optimization). The policy and value networks are small MLPs operating on the observation embedding space.

Training procedure:

Pre-profile each LLM reasoning module on target hardware to build the latency model $\ell_k(m)$ for all $k$.
Initialize orchestrator with uniform random reasoning decisions.
Collect trajectories in ALFRED using the full hierarchical policy (orchestrator + executor + LLM modules).
Compute rewards using $\mathcal{R} = \mathcal{R}_{\text{task}} - \lambda \cdot \mathcal{R}_{\text{resource}}$.
Update $\pi_\phi$ via PPO with clipped surrogate objective:

\[\mathcal{L}^{\text{CLIP}}(\phi) = \mathbb{E}_t \left[ \min\left( \frac{\pi_\phi(a_t \mid o_t)}{\pi_{\phi_{\text{old}}}(a_t \mid o_t)} \hat{A}_t,\; \text{clip}\left(\frac{\pi_\phi(a_t \mid o_t)}{\pi_{\phi_{\text{old}}}(a_t \mid o_t)}, 1\pm\epsilon\right) \hat{A}_t \right) \right]\]

Repeat for multiple epochs over collected batches.

Key hyperparameters:

The $\lambda$ parameter in the reward is swept to produce Pareto curves of success rate vs. latency.
Discount factor $\gamma$ close to 1 since tasks are episodic with sparse terminal rewards.
The LLM backbone used for reasoning modules is a standard large language model (e.g., GPT-class or LLaMA-class); the orchestrator itself is orders of magnitude smaller.

Convergence behavior: The orchestrator learns a non-trivial reasoning schedule — it invokes reasoning heavily during ambiguous or error-prone sub-steps (e.g., picking up a specific object from a cluttered counter) and skips reasoning during routine locomotion (e.g., navigating a clear hallway to the target room). This adaptive pattern emerges from the RL training without manual engineering.

evaluation

Benchmark: ALFRED validation split ($\sim$2,886 episodes). Evaluation reports standard ALFRED metrics: success rate (SR) and goal-condition success rate (GC).

Baselines compared:

Always-reason: invoke LLM reasoning at every timestep (upper bound on reasoning quality, lower bound on latency).
Never-reason: pure low-level policy with no LLM involvement (lower bound on reasoning quality, minimal latency).
Heuristic rules: invoke reasoning when confidence drops below threshold $\tau$, or on a fixed schedule (every $k$ steps).
Adaptive prior work: comparison with fixed-strategy LLM agents (SayCan-style).

Key results: RARRL consistently achieves higher task success rates than heuristic and fixed strategies at equivalent latency budgets. The RL-learned orchestrator finds a Pareto-optimal frontier:

At low latency budgets (aggressive $\lambda$), RARRL retains $\sim$70-80% of the always-reason success rate while reducing LLM calls by $\sim$50-60%.
At high latency budgets (relaxed $\lambda$), RARRL approaches the always-reason ceiling but with modest savings.
Compared to heuristic threshold strategies at matched latency, RARRL outperforms by a significant margin in success rate because it learns when reasoning is most valuable rather than relying on proxy confidence signals.

Resource analysis: The learned orchestration patterns are interpretable. Reasoning is concentrated at decision points with high ambiguity: object selection from cluttered scenes, error recovery after failed grasps, and plan revision when unexpected obstacles are encountered. The orchestrator correctly learns that navigation through open spaces requires zero LLM involvement.

Robustness: RARRL is evaluated under varying latency profiles (simulating different hardware platforms) and maintains its advantage, demonstrating that the RL policy adapts to the specific resource constraints it was trained on.

reproduction guide

Prerequisites:

Python 3.8+ with PyTorch, Stable Baselines3 (or equivalent PPO implementation)
AI2-THOR simulator (version 5.0+) for ALFRED environment
ALFRED benchmark dataset and evaluation code
Access to an LLM API or local LLM serving (for the reasoning modules)

Step-by-step:

Setup ALFRED environment:

pip install ai2thor
git clone https://github.com/AllenAI/alfred
cd alfred && pip install -e .

Profile LLM latency: Before RL training, benchmark each reasoning role on your target hardware. Measure wall-clock time for each role at several budget levels. Fit or tabulate the latency functions $\ell_k(m)$.
Train the orchestrator: Initialize the orchestrator network and run PPO training on the ALFRED training split. The training loop alternates between environment interaction (collecting trajectories with the current orchestration policy) and policy updates.
Sweep $\lambda$: Train separate orchestrators for different values of $\lambda$ to trace out the Pareto frontier of success rate vs. latency.
Evaluate: Run the trained orchestrator on the ALFRED validation split. Report success rate, goal-condition success rate, total latency, and number of LLM invocations per episode.

Known challenges:

No public code repo is available as of submission. Reproducing exactly requires re-implementing the orchestration framework.
ALFRED environment setup can be finicky (specific AI2-THOR version, dataset download).
The latency profiling step is hardware-dependent; results will vary across GPU/CPU/embedded platforms.
RL training on ALFRED is computationally expensive due to the simulator overhead. Expect training to require significant GPU-hours.

Compute cost estimate: Training the orchestrator requires running many ALFRED episodes. Each episode involves simulator rendering plus periodic LLM calls. Expect training to take on the order of tens of GPU-hours, though the orchestrator network itself is small.

notes

Why this matters for bopi: The core idea — treating LLM reasoning as a sparse, expensive resource controlled by a lightweight RL policy — is directly applicable to embedded robotics where every token costs latency and battery. An orchestrator that learns to invoke reasoning only when the task demands it could enable running powerful LLMs on constrained hardware by using them sparingly. The key question this opens is: how small can the orchestrator be? Could it run on an MCU while delegating reasoning to a larger model over a low-bandwidth connection?

Connections to related work:

Adaptive computation: Related to early exit networks and adaptive depth transformers that allocate computation proportional to input difficulty. RARRL extends this idea to the decision layer rather than the network layer.
Hierarchical RL: Classic options framework where a high-level policy selects low-level skills. RARRL’s twist is that the high-level policy selects whether to invoke expensive reasoning, not just which skill to execute.
LLM agents on robots: SayCan grounds LLMs in affordances, Inner Monologue adds environment feedback loops. RARRL adds the missing dimension: how much LLM to use per step.

Open questions:

Transfer of the orchestrator across environments — does the learned “when to think” pattern transfer from kitchen tasks to novel scenes?
Online adaptation — can the orchestrator adjust $\lambda$ at test time based on remaining battery?
The latency model $\ell_k(m)$ is static (profiled before training). What if LLM latency varies dynamically (thermal throttling, concurrent workloads)?
Scaling to more complex task suites (e.g., Mobile manipulation, long-horizon multi-room tasks) where the reasoning landscape is more complex.
Combining RARRL’s orchestration with model compression — a smaller LLM invoked more often vs. a larger LLM invoked sparingly.

Weakest assumption: The paper assumes that the latency profile is known and stable. On real embedded hardware, LLM inference latency can vary significantly with thermal state, memory pressure, and power management. A more robust approach would learn to estimate the cost online rather than rely on a pre-profiled lookup table.