VLMPose

VLMs as Agents for Text-Guided Closed-Loop
Object 6D Pose Rearrangement

¹Seoul National University, ²RLWRLD

arXiv 2026

Abstract

Vision-Language Models (VLMs) exhibit strong visual reasoning capabilities, yet they still struggle with 3D understanding. In particular, VLMs often fail to infer a text-consistent goal 6D pose of a target object in a 3D scene. However, we find that with some inference-time techniques and iterative reasoning, VLMs can achieve dramatic performance gains. Concretely, given a 3D scene represented by an RGB-D image (or a compositional scene of 3D meshes) and a text instruction specifying a desired state change, we repeat the following loop: observe the current scene; evaluate scene validity; propose a pose update for the target object; apply the update; and render the updated scene. Through this closed-loop interaction, the VLM effectively acts as an agent. We further introduce three inference-time techniques that are essential to this closed-loop process: (i) multi-view reasoning with supporting view selection, (ii) object-cetric coordinate-system visualization, and (iii) single-axis rotation prediction. Without any additional fine-tuning or new modules, our approach surpasses prior methods at predicting the text-guided goal 6D pose of the target object. It works consistently across both closed-source and open-source VLMs. Moreover, when combining our 6D pose prediction with simple robot motion planning, it enables more successful robot manipulation than existing methods. Finally, we conduct an ablation study to demonstrate the necessity of each proposed technique.

Method Overview

(a) We first ask the evaluator to assess how well the rendered multi-view images satisfy the input text instruction. If the current scene is judged inconsistent with the instruction, we provide the proposer with the supporting view, augmented with coordinate system visualization, that best explains the evaluator’s judgment. Based on this input, the proposer predicts an incremental 6D pose update. The predicted update is then applied to the target object, and this process is repeated iteratively. Throughout all iterations, both roles receive the accumulated context memory. (b) Example of iterative pose updates produced by our method for a scene containing a teacup and a teapot on a table, given the text instruction "Pour tea into a teacup using a teapot."

Results

Object 6D Pose Iterative Refinement

Robot Manipulation in Simulation with Our Goal 6D Pose Predictions

Comparison with Other Methods on the Robot Manipulation Task

@article{vlmpose, title={VLMs as Agents for Text-Guided Closed-Loop Object 6D Pose Rearrangement}, author={Baik, Sangwon and Kim, Gunhee and Choi, Mingi and Joo, Hanbyul}, journal={arXiv:2511.20446}, year={2026} }

VLMs as Agents for Text-Guided Closed-Loop
Object 6D Pose Rearrangement

TL;DR: Given a 3D scene, our goal is to rearrange the 6d pose of a target object so that the scene satisfies a given text instruction.

Abstract

Method Overview

Video Presentation

Results

Object 6D Pose Iterative Refinement

Robot Manipulation in Simulation with Our Goal 6D Pose Predictions

Comparison with Other Methods on the Robot Manipulation Task

BibTeX

VLMs as Agents for Text-Guided Closed-LoopObject 6D Pose Rearrangement

TL;DR: Given a 3D scene, our goal is to rearrange the 6d pose of a target object so that the scene satisfies a given text instruction.

Abstract

Method Overview

Video Presentation

Results

Object 6D Pose Iterative Refinement

Robot Manipulation in Simulation with Our Goal 6D Pose Predictions

Comparison with Other Methods on the Robot Manipulation Task

BibTeX

VLMs as Agents for Text-Guided Closed-Loop
Object 6D Pose Rearrangement