Efficient Robotic Policy Learning via
Latent Space Backward Planning


Anonymous Authors


Abstract

Current robotic planning methods often rely on predicting multi-frame images with full pixel details. While this fine-grained approach can serve as a generic world model, it introduces two significant challenges for downstream policy learning: substantial computational costs that hinder real-time deployment, and accumulated inaccuracies that can mislead action extraction. Planning with coarse-grained subgoals partially alleviates efficiency issues. However, their forward planning schemes can still result in off-task predictions due to accumulation errors, leading to misalignment with long-term goals. This raises a critical question: Can robotic planning be both efficient and accurate enough for real-time control in long-horizon, multi-stage tasks? To address this, we propose a Latent space Backward Planning scheme (LBP), which begins by grounding the task into final latent goals, followed by recursively predicting intermediate subgoals closer to the current state. The grounded final goal enables backward subgoal planning to always remain aware of task completion, facilitating on-task prediction along the entire planning horizon. The subgoal-conditioned policy incorporates a learnable token to summarize the subgoal sequences and determines how each subgoal guides action extraction. Through extensive simulation and real-robot long-horizon experiments, we show that LBP outperforms existing fine-grained and forward planning methods, achieving SOTA performance.

Illustration of latent space backward planning

algebraic reasoning

Figure 1. Illustration of latent space backward planning.

Overall framework architecture of LBP

algebraic reasoning

Figure 2. Overall framework architecture of LBP.

Real world experiments


Experimental setup and task definitions

To investigate the effectiveness of LBP in real world, we specifically design four long-horizon tasks: Stack 3 cups, Move cups, Stack 4 cups and Shift cups. Each task is decomposed into multiple sequential stages, as shown in Figure 3, requiring the robot to perform fundamental pick-and-place operations. These tasks establish a critical dependency where progress in subsequent stages is contingent on successful execution of preceding ones. We assess task performance using a stage-based scoring system with discrete values {0, 25, 50, 75, 100} for each stage, where each score corresponds to the completion progress of the current stage. A stage is assigned 100 only upon successful completion of the entire stage. In Figure 4, we present the quantitative comparison on the real-world tasks.


algebraic reasoning

Figure 3. Left: the entire desktop environment setups of real-world experiments contains a 6 DoF AIRBOT arm and three Logitech cameras with different views; Right: (1) Move cups move both brown cups in front of the white ones; (2) Stack cups: stack all paper cups together; (3) Shift cups: shift all the paper cups to another plate, in a clockwise direction.


algebraic reasoning

Figure 4. Real-world main results. We evaluate LCBC, GLCBC, SuSIE and LBP in aforementioned 4 tasks. For each task, we present the average performance of last-3 checkpoints. The metric "Avg. Score" measures the average score for each stage. We observe that while LBP slightly outperforms other strong baselines at the early stages, LBP wins by a fairly large margin at the final stages of all tasks. This shows LBP significantly excels in handling long-horizon tasks.

Comparison to forward planning

To verify the effectiveness of the backward planning approach, we train conventional forward planners \( f(z_{t+k}|z_t,\phi_l) \) and \( f(z_{t+nk}|z_t,z_{t+(n-1)k},\phi_l), n=2,\cdots \) with the same network architecture for fair comparison. While our planners progressively predict subgoals \( z_{g}, w_1, w_2, \dots, w_n \) in a backward manner, the forward planners predict the subgoal \( k=10 \) steps ahead at a time, sequentially generating latent subgoals \( z_{10}, z_{20}, \dots,z_{10n} \). We randomly sample one trajectory from each real-world AIRBOT task and calculate the Mean Squared Error (MSE) between the predicted subgoals and their corresponding ground truths for both planners as presented in Figure 5.


algebraic reasoning

Figure 5. Mean Squared Errors (MSE) between predicted subgoals and corresponding ground truths in forward and backward planning.


It can be observed that the compounding errors of forward planning grow rapidly across all tasks; for the most challenging task "Shift cups", the error becomes unbearable when predicting distant subgoals. Even more concerning is that some existing methods attempt to predict continuous frames of future images, which would exacerbate these issues further. In contrast, our method maintains an extremely low error throughout the entire planning horizon. These results highlight the advantage of our backward planning strategy, which enables accurate subgoal predictions using only a few subgoals throughout the entire trajectory. LBP not only ensures efficient prediction but also effectively reduces compounding errors, in contrast to forward planning, which requires many predictions and suffers from significant accumulated errors.



Raw video of long horizon task performance

Stack 3 cups



Stack 4 cups



Move cups



Shift Cups



LIBERO-LONG benchmark


LIBERO-LONG consists of 10 distinct long-horizon robotic manipulation tasks that require diverse skills such as pick-ing up objects, turning on a stove, and closing a microwave. These tasks involve multi-stage decision-making and span a variety of scenarios, making them particularly challenging. Table 1 presents the quantitative comparison on the LIBERO-LONG benchmark. LBP outperforms all baselines, achieving higher success rates across the majority of tasks.

Table 1. LIBERO-LONG results. For each task, we present the average performance of top-3 checkpoints. The metric ``Avg. Success'' measures the average success rate across 10 tasks. LBP outperforms baselines with higher Avg. Success and better results on most tasks. The best results are bolded.

algebraic reasoning


Task 1: put soup and sauce in basket

Task 2: put box and butter in basket

Task 3: turn on stove and put pot




Task 4: put bowl in drawer and close it

Task 5: put mugs on left and right plates

Task 6: pick book and place it in back



Task 7: put mug on plate and put pudding to right

Task 8: put soup and box in basket


Task 9: put both pots on stove




Task 10: put mug in microwave and close it