Learning to Generate Human-Human-Object Interactions from Textual Descriptions

Seoul National University
* Equal Contribution     † Corresponding Author
NeurIPS 2025
Teaser Image

Given a textual description of HHOI (Human–Human–Object Interaction) involving interactions among multiple people and between people and an object, our model generates the body poses of all humans as well as their global orientations and translations with respect to the given object.

Abstract

The way humans interact with each other, including interpersonal distances, spatial configuration, and motion, varies significantly across different situations. To enable machines to understand such complex, context-dependent behaviors, it is essential to model multiple people in relation to the surrounding scene context. In this paper, we present a novel research problem to model the correlations between two people engaged in a shared interaction involving an object. We refer to this formulation as Human-Human-Object Interactions (HHOIs). To overcome the lack of dedicated datasets for HHOIs, we present a newly captured HHOIs dataset and a method to synthesize HHOI data by leveraging image generative models. As an intermediary, we obtain individual human-object interaction (HOIs) and human-human interaction (HHIs) from the HHOIs, and with these data, we train an text-to-HOI and text-to-HHI model using score-based diffusion model. Finally, we present a unified generative framework that integrates the two individual model, capable of synthesizing complete HHOIs in a single advanced sampling process. Our method extends HHOI generation to multi-human settings, enabling interactions involving more than two individuals. Experimental results show that our method generates realistic HHOIs conditioned on textual descriptions, outperforming previous approaches that focus only on single-human HOIs. Furthermore, we introduce multi-human motion generation involving objects as an application of our framework.

Dataset Capture

Method Image
To address the lack of diverse HHOI data, we introduce a newly collected 3D dataset captured using a multi-camera system, specifically designed to support the training and evaluation of HHOI models. The object and human poses are tracked with AruCo markers and DWPose respectively.


Method Overview

Method Image
We split the captured HHOI dataset into an HOI dataset and an HHI dataset, and train separate HOI and HHI diffusion models on each subset. During inference, by leveraging our proposed advanced HHOI sampling strategy, we jointly sample from both the HOI and HHI diffusion models, enabling generation of interactions involving multiple humans.

Video Presentation

Results

Comparison with Competitors on HHOI Generation

HHOI Generation with Multiple Humans

Application: Human Motion In-Betweening

BibTeX

@inproceedings{hhoi,
      title={Learning to Generate Human-Human-Object Interactions from Textual Descriptions}, 
      author={Na, Jeonghyeon and Baik, Sangwon and Lee, Inhee and Lee, Junyoung and Joo, Hanbyul},
      booktitle={NeurIPS},
      year={2025}
}