Reinforcement learning (RL) is a promising approach for robotic manipulation, but it can suffer from low sample efficiency and requires extensive exploration of large state-action spaces. Recent methods leverage the commonsense knowledge and reasoning abilities of large language models (LLMs) to guide exploration toward more meaningful states. However, LLMs can produce plans that are semantically plausible yet physically infeasible, yielding unreliable behavior. We introduce LLM-TALE, a framework that uses LLMs' planning to directly steer RL exploration. LLM-TALE integrates planning at both the task level and the affordance level, improving learning efficiency by directing agents toward semantically meaningful actions. Unlike prior approaches that assume optimal LLM-generated plans or rewards, LLM-TALE corrects suboptimality online and explores multimodal affordance-level plans without human supervision. We evaluate LLM-TALE on pick-and-place tasks in standard RL benchmarks, observing improvements in both sample efficiency and success rates over strong baselines. Real-robot experiments indicate promising zero-shot sim-to-real transfer.
We introduce LLM-guided Task- and Affordance-Level Exploration (LLM-TALE), a method that steers exploration toward semantically meaningful regions of the state space by leveraging LLM guidance that is more reliable at higher levels of abstraction (i.e., task and affordance). Because affordance-level actions are often multimodal and some modes are semantically valid yet physically infeasible, our method uses the critic’s value estimates (e.g., Q-values) to explore more promising modes and to avoid fruitless exploration of infeasible actions. LLM-TALE has the following features:
LLM-guided Task- and Affordance-Level Exploration (LLM-TALE) uses LLMs to generate task-level and affordance-level plans, directing RL to explore semantically meaningful regions. It also enables multimodal affordance-level exploration using goal-conditioned value functions.
LLM-TALE converts high-level task instructions into executable plans through three prompts: a task-level prompt T, an affordance identifier prompt M, and an affordance planner prompt P. Given a high-level description L, T yields a reusable sequence of Python primitives p1:n (e.g., robot.pick(object)) that decompose the task into actions such as “Pick up the cube” or “Place the box in the cupboard.” Because affordance-level plans are semantically multimodal, not every valid description is physically feasible. The prompt M enumerates distinct modes for each primitive pj, producing language variants lfj1:m such as “Pick the box from the side” or “Pick the box from the top.” The prompt P maps each description lfji to a natural-language affordance fji that specifies the end-effector (EE) pose rather than precise SE(3) coordinates.
Detailed visualization of the task planner scheme, showing the structure of prompts T, M, and P.
Our online exploration module handles multimodal affordances by turning each affordance plan fj1:m into candidate goal poses gj1:m and selecting among them in a value-aware, uncertainty-guided way. For each primitive pj, we score every candidate gji using the value function (which estimates expected return under the current observation) and an uncertainty factor, then sample a goal from the induced distribution. With a state-value function, the selection probabilities are psel(i) ∝ exp(β Vφπ(s, gji)) cji, and an analogous form applies when using a Q-function Qφ(s, a). Here, β>0 controls how sharply high-value goals are favored, while cji manages the exploration–exploitation trade-off. After sampling i ∼ psel, we set gj ← gj(i) and update that option’s uncertainty via cj(i) ← max((1−α) cj(i), cmin), with α ∈ (0, 1) and cmin as minimum uncertainty value. This trades off exploration and exploitation across multiple affordance modes.
LLM-TALE explores affordance modalities based on value V and uncertainty score c.
Our simulation evaluations span six tasks: three from RLBench and three from ManiSkill. The robot end-effector is controlled with relative position or velocity commands rather than joint-space control, which simplifies the action space for language models. We learn policies with a residual action space on top of base actions defined by pre-existing primitives. These primitives are implemented with a PD controller that moves linearly toward the target. The proportional gain is set to 1 by default, and motion is limited by maximum velocity and orientation-velocity bounds. For the pick primitive, the episode terminates when the object's displacement from its initial position exceeds a threshold. For the transport primitive, termination occurs when the positional error to the goal is below a threshold and the object is nearly static. We evaluate sample efficiency by comparing our method against three baselines and an ablation of LLM-TALE: Text2Reward (zero-shot and few-shot), RLPD with 25 high-quality demonstrations, and LLM-BC (ablation of LLM-TALE where the PD-control base policy is replaced by a behavior cloning base policy). These results indicate high sample efficiency for both TD3 and PPO variants of LLM-TALE, wihout requiring high-quality demonstration data.
Recordings of the policies trained with LLM-TALE. You can select the algorithm, task and seed to be shown.
Algorithm:To assess real-world performance, we evaluate zero-shot transfer of a simulation-trained policy on the ManiSkill PutBox task using a physical setup. The task requires stable picking and collision-free placement, making it a suitable task for evaluating the sim-to-real performance of LLM-TALE. We used a Franka Emika Panda with a Franka gripper, controlled by a Cartesian impedance controller that commands the end-effector pose at 1 kHz. Our method sent end-effector pose commands to the impedance controller at a lower rate. We performed real-time object tracking using a RealSense D435 RGB-D camera. We deployed a policy trained in simulation using LLM-TALE with TD3 on the robot and ran 15 episodes using our method and an LLM-only controller that directly executed LLM-generated plans. Our method achieved a success rate of 93.3% with one failure.
Zero-shot sim-to-real experiment for the PutBox task.