Learning to Generate Human-Human-Object Interactions from Textual Descriptions

Seoul National University
* Equal Contribution † Corresponding Author

NeurIPS 2025

Abstract

The way humans interact with each other, including interpersonal distances, spatial configuration, and motion, varies significantly across different situations. To enable machines to understand such complex, context-dependent behaviors, it is essential to model multiple people in relation to the surrounding scene context. In this paper, we present a novel research problem to model the correlations between two people engaged in a shared interaction involving an object. We refer to this formulation as Human-Human-Object Interactions (HHOIs). To overcome the lack of dedicated datasets for HHOIs, we present a newly captured HHOIs dataset and a method to synthesize HHOI data by leveraging image generative models. As an intermediary, we obtain individual human-object interaction (HOIs) and human-human interaction (HHIs) from the HHOIs, and with these data, we train an text-to-HOI and text-to-HHI model using score-based diffusion model. Finally, we present a unified generative framework that integrates the two individual model, capable of synthesizing complete HHOIs in a single advanced sampling process. Our method extends HHOI generation to multi-human settings, enabling interactions involving more than two individuals. Experimental results show that our method generates realistic HHOIs conditioned on textual descriptions, outperforming previous approaches that focus only on single-human HOIs. Furthermore, we introduce multi-human motion generation involving objects as an application of our framework.

Method Overview

We split the captured HHOI dataset into an HOI dataset and an HHI dataset, and train separate HOI and HHI diffusion models on each subset. During inference, by leveraging our proposed advanced HHOI sampling strategy, we jointly sample from both the HOI and HHI diffusion models, enabling generation of interactions involving multiple humans.

Results

Comparison with Competitors on HHOI Generation

Ours

GenZI

Depth Opt.

Two people are sharing an umbrella.

Ours

GenZI

Depth Opt.

Two people are helping another person in a wheelchair.

Ours

GenZI

Depth Opt.

Three people are sitting next to a campfire.

HHOI Generation with Multiple Humans

N = 2

N = 3

N = 4

N = 5

People are looking at a laptop white sitted.

N = 2

N = 3

N = 4

N = 5

People are having a discussion in front of a whiteboard.

Application: Human Motion In-Betweening

Three people walk to the bench and sit together.

Two people walk to a vending machine and use it.

@inproceedings{hhoi, title={Learning to Generate Human-Human-Object Interactions from Textual Descriptions}, author={Na, Jeonghyeon and Baik, Sangwon and Lee, Inhee and Lee, Junyoung and Joo, Hanbyul}, booktitle={NeurIPS}, year={2025} }

Learning to Generate Human-Human-Object Interactions from Textual Descriptions

Given a textual description of HHOI (Human–Human–Object Interaction) involving interactions among multiple people and between people and an object, our model generates the body poses of all humans as well as their global orientations and translations with respect to the given object.

Abstract

Dataset Capture

Method Overview

Video Presentation

Results

Comparison with Competitors on HHOI Generation

Two people are sharing an umbrella.

Two people are helping another person in a wheelchair.

Three people are sitting next to a campfire.

HHOI Generation with Multiple Humans

People are looking at a laptop white sitted.

People are having a discussion in front of a whiteboard.

Application: Human Motion In-Betweening

Three people walk to the bench and sit together.

Two people walk to a vending machine and use it.

BibTeX