LLM-Guided Task- and Affordance-Level Exploration in Reinforcement Learning

Abstract

Reinforcement learning (RL) is a promising approach for robotic manipulation, but it can suffer from low sample efficiency and requires extensive exploration of large state-action spaces. Recent methods leverage the commonsense knowledge and reasoning abilities of large language models (LLMs) to guide exploration toward more meaningful states. However, LLMs can produce plans that are semantically plausible yet physically infeasible, yielding unreliable behavior. We introduce LLM-TALE, a framework that uses LLMs' planning to directly steer RL exploration. LLM-TALE integrates planning at both the task level and the affordance level, improving learning efficiency by directing agents toward semantically meaningful actions. Unlike prior approaches that assume optimal LLM-generated plans or rewards, LLM-TALE corrects suboptimality online and explores multimodal affordance-level plans without human supervision. We evaluate LLM-TALE on pick-and-place tasks in standard RL benchmarks, observing improvements in both sample efficiency and success rates over strong baselines. Real-robot experiments indicate promising zero-shot sim-to-real transfer.

LLM-TALE

We introduce LLM-guided Task- and Affordance-Level Exploration (LLM-TALE), a method that steers exploration toward semantically meaningful regions of the state space by leveraging LLM guidance that is more reliable at higher levels of abstraction (i.e., task and affordance). Because affordance-level actions are often multimodal and some modes are semantically valid yet physically infeasible, our method uses the critic’s value estimates (e.g., Q-values) to explore more promising modes and to avoid fruitless exploration of infeasible actions. LLM-TALE has the following features:

A hierarchical, LLM-driven planning scheme that generates task-level plans and affordance-level action candidates.
We present a goal-conditioned residual RL framework in which goals are derived from LLM-generated affordances, and exploration is guided by intrinsic rewards defined relative to these goals.
We introduce critic- and uncertainty-guided affordance-level exploration over LLM-generated proposals, enabling a trade-off between exploration and exploitation across affordance modalities.

These components bias exploration toward semantically meaningful regions of the action space. Our experiments show improved sample efficiency in sparse-reward robotic manipulation for both on- and off-policy RL, while real-world evaluations show promising zero-shot sim-to-real transfer.

LLM-guided Task- and Affordance-Level Exploration (LLM-TALE) uses LLMs to generate task-level and affordance-level plans, directing RL to explore semantically meaningful regions. It also enables multimodal affordance-level exploration using goal-conditioned value functions.

Task Planning

LLM-TALE converts high-level task instructions into executable plans through three prompts: a task-level prompt T, an affordance identifier prompt M, and an affordance planner prompt P. Given a high-level description L, T yields a reusable sequence of Python primitives p_1:n (e.g., robot.pick(object)) that decompose the task into actions such as “Pick up the cube” or “Place the box in the cupboard.” Because affordance-level plans are semantically multimodal, not every valid description is physically feasible. The prompt M enumerates distinct modes for each primitive p_j, producing language variants l_{f_j^1:m} such as “Pick the box from the side” or “Pick the box from the top.” The prompt P maps each description l_{f_jⁱ} to a natural-language affordance f_jⁱ that specifies the end-effector (EE) pose rather than precise SE(3) coordinates.

Detailed visualization of the task planner scheme, showing the structure of prompts T, M, and P.

Online Exploration

Our online exploration module handles multimodal affordances by turning each affordance plan f_j^1:m into candidate goal poses g_j^1:m and selecting among them in a value-aware, uncertainty-guided way. For each primitive p_j, we score every candidate g_jⁱ using the value function (which estimates expected return under the current observation) and an uncertainty factor, then sample a goal from the induced distribution. With a state-value function, the selection probabilities are p_sel(i) ∝ exp(β V_φ^π(s, g_jⁱ)) c_jⁱ, and an analogous form applies when using a Q-function Q_φ(s, a). Here, β>0 controls how sharply high-value goals are favored, while c_jⁱ manages the exploration–exploitation trade-off. After sampling i ∼ p_sel, we set g_j ← g_j⁽ⁱ⁾ and update that option’s uncertainty via c_j⁽ⁱ⁾ ← max((1−α) c_j⁽ⁱ⁾, c_min), with α ∈ (0, 1) and c_min as minimum uncertainty value. This trades off exploration and exploitation across multiple affordance modes.

LLM-TALE explores affordance modalities based on value V and uncertainty score c.

Simulation Experiments

Our simulation evaluations span six tasks: three from RLBench and three from ManiSkill. The robot end-effector is controlled with relative position or velocity commands rather than joint-space control, which simplifies the action space for language models. We learn policies with a residual action space on top of base actions defined by pre-existing primitives. These primitives are implemented with a PD controller that moves linearly toward the target. The proportional gain is set to 1 by default, and motion is limited by maximum velocity and orientation-velocity bounds. For the pick primitive, the episode terminates when the object's displacement from its initial position exceeds a threshold. For the transport primitive, termination occurs when the positional error to the goal is below a threshold and the object is nearly static. We evaluate sample efficiency by comparing our method against three baselines and an ablation of LLM-TALE: Text2Reward (zero-shot and few-shot), RLPD with 25 high-quality demonstrations, and LLM-BC (ablation of LLM-TALE where the PD-control base policy is replaced by a behavior cloning base policy). These results indicate high sample efficiency for both TD3 and PPO variants of LLM-TALE, wihout requiring high-quality demonstration data.

Recordings of the policies trained with LLM-TALE. You can select the algorithm, task and seed to be shown.

Algorithm:

Task:

Seed:

Real-World Experiments

To assess real-world performance, we evaluate zero-shot transfer of a simulation-trained policy on the ManiSkill PutBox task using a physical setup. The task requires stable picking and collision-free placement, making it a suitable task for evaluating the sim-to-real performance of LLM-TALE. We used a Franka Emika Panda with a Franka gripper, controlled by a Cartesian impedance controller that commands the end-effector pose at 1 kHz. Our method sent end-effector pose commands to the impedance controller at a lower rate. We performed real-time object tracking using a RealSense D435 RGB-D camera. We deployed a policy trained in simulation using LLM-TALE with TD3 on the robot and ran 15 episodes using our method and an LLM-only controller that directly executed LLM-generated plans. Our method achieved a success rate of 93.3% with one failure.

Zero-shot sim-to-real experiment for the PutBox task.