1Department of Computer Science, Rutgers University–New Brunswick
2The Hong Kong University of Science and Technology (GZ)
3Shanghai AI Laboratory
*Equal contribution. †Corresponding author.
Corresponding Author:
Jingjin Yu (jingjin.yu@cs.rutgers.edu)
Project Leaders:
Litao Liu (litao.liu@rutgers.edu),
Yifan Han (hanyifan2024@ia.ac.cn)
This work was completed by Yifan Han, Pengfei Yi, and Wenbo Yu during their internship at Rutgers University–New Brunswick.
Task-conditioned manipulation requires grounding instructions to task-relevant functional parts rather than object categories. This setting is scene-dependent and often one-to-many in cluttered scenes: the same object may afford different interactions across tasks, while a single task may correspond to either one functional region or multiple valid functional regions, depending on the scene layout. Existing affordance datasets and benchmarks remain misaligned with this setting, as they typically focus on grasping or object-level affordances, rely on synthetic scenes, or assume a single instruction–region correspondence. We present Affordance2Action (A2A), a benchmark-centered learning framework for scene-level, task-conditioned part affordance grounding. At its core is A2A-Bench, a manipulation-oriented benchmark that covers both single-region and multi-region instruction correspondences in everyday scenes, with the latter highlighting the ambiguity and diversity of affordance grounding in realistic multi-object environments. To construct it at scale, we build A2A-AffordGen, an agent-assisted annotation pipeline that combines language-model filtering, interactive part segmentation, instance-level mask-out refinement, task-reasoning instruction generation, and human verification. A2A-Bench's supervision readily enables training a real-time grounding model (A2A-GroundingModel) and integrating its predictions into task-conditioned manipulation policies (A2A-Policy). Experiments show that A2A exposes substantial gaps in generic segmentation, VLM-based grounding, and affordance distillation baselines, while improving task-level localization and downstream manipulation.
A scene-level, task-conditioned, one-to-many benchmark for robot-oriented part affordance grounding, associating manipulation intents with multiple actionable functional regions in real-world scenes.
An agent-assisted data construction pipeline that scales multi-object affordance annotation via language-model filtering, interactive part segmentation, instance-level mask-out refinement, task-reasoning instruction generation, and human verification.
A2A-GroundingModel adapts SAM3 for real-time task-conditioned part grounding from image–instruction pairs, while A2A-Policy incorporates the predicted masks as structured visual priors for manipulation.
Real-time task-conditioned affordance grounding from natural-language instructions.
At the policy level, A2A-Policy integrates the masks predicted by A2A-GroundingModel into a language-conditioned manipulation policy as structured visual–action priors. Rather than forcing the policy to infer task-relevant regions implicitly from raw observations and demonstrations, the grounded functional regions explicitly direct the policy toward actionable parts of the scene. This converts A2A-Bench supervision into policy-ready guidance, reducing the data burden, improving interpretability, and supporting real-time deployment on a real robot arm.
A2A-Policy deployed on a real robot arm across everyday manipulation tasks.
Start–to–end rollouts comparing A2A-Implicit, A2A-Explicit, and UAD-DP on each task.




@article{liu2026affordance2action,
title = {Affordance2Action: Task-Conditioned Scene-level Affordance
Grounding for Real-Time Manipulation},
author = {Liu, Litao and Han, Yifan and Yi, Pengfei and Yu, Wenbo and
Wang, Hanqing and Du, Haoran and Yuan, Enze and Yuan, Zilin and
Feng, Ruiding and Liu, Michael and others},
journal = {arXiv preprint arXiv:2606.04172},
year = {2026}
}