Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models

Seoul National University
Teaser Image

Given a textual description of the spatial relationship between two objects, our method models OOR, representing their relative poses and scales according to the text. We obtain OOR samples using off-the-shelf models and a proposed mesh registration method, then learn their distribution through diffusion. During inference, our OOR diffusion model generates OOR conditioned on the text input.

Abstract

We present a method for learning 3D spatial relationships between object pairs, referred to as object-object spatial relationships (OOR), by leveraging synthetically generated 3D samples from pre-trained 2D diffusion models. We hypothesize that images synthesized by 2D diffusion models inherently capture plausible and realistic OOR cues, enabling efficient ways to collect a 3D dataset to learn OOR for various unbounded object categories. Our approach begins by synthesizing diverse images that capture plausible OOR cues, which we then uplift into 3D samples. Leveraging our diverse collection of plausible 3D samples for the object pairs, we train a score-based OOR diffusion model to learn the distribution of their relative spatial relationships. Additionally, we extend our pairwise OOR to multi-object OOR by enforcing consistency across pairwise relations. Extensive experiments demonstrate the robustness of our method across various object-object spatial relationships, along with its applicability to real-world 3D scene arrangement tasks using the OOR diffusion model.

Method Overview

Method Image
For a given text prompt related to an object pair, we obtain multi-view images and point clouds using off-the-shelf models. Then, we lift pixel features to obtain 3D point features. We repeat the same process for the rendered images of collected meshes. Finally, we perform Procrustes analysis with RANSAC to estimate the relative pose and scale of each object.


Method Image
Our OOR diffusion generates OOR samples by taking the context \( \mathbf{c} \), base object category \( \mathcal{B} \), and target object category \( \mathcal{T} \) as text conditions.

Video Presentation

Results

Denoising Process of Pairwise OOR Generation

Denoising Process of Multi-object OOR Generation

Scene Editing

BibTeX

@misc{oor,
      title={Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models}, 
      author={Sangwon Beak and Hyeonwoo Kim and Hanbyul Joo},
      year={2025},
      eprint={2503.19914},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.19914}, 
}