Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models

¹Seoul National University, ²RLWRLD

ICCV 2025

Abstract

We present a method for learning 3D spatial relationships between object pairs, referred to as object-object spatial relationships (OOR), by leveraging synthetically generated 3D samples from pre-trained 2D diffusion models. We hypothesize that images synthesized by 2D diffusion models inherently capture plausible and realistic OOR cues, enabling efficient ways to collect a 3D dataset to learn OOR for various unbounded object categories. Our approach begins by synthesizing diverse images that capture plausible OOR cues, which we then uplift into 3D samples. Leveraging our diverse collection of plausible 3D samples for the object pairs, we train a score-based OOR diffusion model to learn the distribution of their relative spatial relationships. Additionally, we extend our pairwise OOR to multi-object OOR by enforcing consistency across pairwise relations. Extensive experiments demonstrate the robustness of our method across various object-object spatial relationships, along with its applicability to real-world 3D scene arrangement tasks using the OOR diffusion model.

Method Overview

For a given text prompt related to an object pair, we obtain multi-view images and point clouds using off-the-shelf models. Then, we lift pixel features to obtain 3D point features. We repeat the same process for the rendered images of collected meshes. Finally, we perform Procrustes analysis with RANSAC to estimate the relative pose and scale of each object.

Our OOR diffusion generates OOR samples by taking the context \( \mathbf{c} \), base object category \( \mathcal{B} \), and target object category \( \mathcal{T} \) as text conditions.

Results

Denoising Process of Pairwise OOR Generation

A hammer drives a nail.

A teapot pours tea into a teacup.

A fork pokes a steak.

A plunger unclogs a toilet.

A monitor is positioned on a desk.

Denoising Process of Multi-object OOR Generation

A monitor is positioned on a desk.
+
A keyboard is in front of a monitor.
+
A keyboard is positioned on a desk.
+
A mouse is next to a keyboard.

A salt shaker sprinkles salt into a pan. × 4

An apple is positioned on a cutting board.
+
A knife cuts an apple.

Scene Editing

Denoising process for noisy scene.

Applying new context

Adding object + Applying new context

@inproceedings{oor, title={Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models}, author={Sangwon Baik and Hyeonwoo Kim and Hanbyul Joo}, booktitle = ICCV, year={2025} }

Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models

Abstract

Method Overview

Video Presentation

Results

Denoising Process of Pairwise OOR Generation

A hammer drives a nail.

A teapot pours tea into a teacup.

A fork pokes a steak.

A plunger unclogs a toilet.

A monitor is positioned on a desk.

Denoising Process of Multi-object OOR Generation

A monitor is positioned on a desk.
+
A keyboard is in front of a monitor.
+
A keyboard is positioned on a desk.
+
A mouse is next to a keyboard.

A salt shaker sprinkles salt into a pan. × 4

An apple is positioned on a cutting board.
+
A knife cuts an apple.

Scene Editing

Denoising process for noisy scene.

Applying new context

Adding object + Applying new context

BibTeX

Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models

Abstract

Method Overview

Video Presentation

Results

Denoising Process of Pairwise OOR Generation

A hammer drives a nail.

A teapot pours tea into a teacup.

A fork pokes a steak.

A plunger unclogs a toilet.

A monitor is positioned on a desk.

Denoising Process of Multi-object OOR Generation

A monitor is positioned on a desk.+A keyboard is in front of a monitor.+A keyboard is positioned on a desk.+A mouse is next to a keyboard.

A salt shaker sprinkles salt into a pan. × 4

An apple is positioned on a cutting board.+A knife cuts an apple.

Scene Editing

Denoising process for noisy scene.

Applying new context

Adding object + Applying new context

BibTeX

A monitor is positioned on a desk.
+
A keyboard is in front of a monitor.
+
A keyboard is positioned on a desk.
+
A mouse is next to a keyboard.

An apple is positioned on a cutting board.
+
A knife cuts an apple.