Affordance2Action (A2A)

Task-Conditioned Scene-level Affordance Grounding
for Real-Time Manipulation

Litao Liu1,* Yifan Han1,* Pengfei Yi1 Wenbo Yu1 Hanqing Wang2 Haoran Du1 Enze Yuan1 Zilin Yuan1 Ruiding Feng1 Michael Liu1 Qi Zhang3 Jingjin Yu1,†

1Department of Computer Science, Rutgers University–New Brunswick

2The Hong Kong University of Science and Technology (GZ)

3Shanghai AI Laboratory

*Equal contribution.   Corresponding author.

Corresponding Author: Jingjin Yu (jingjin.yu@cs.rutgers.edu)
Project Leaders: Litao Liu (litao.liu@rutgers.edu), Yifan Han (hanyifan2024@ia.ac.cn)

This work was completed by Yifan Han, Pengfei Yi, and Wenbo Yu during their internship at Rutgers University–New Brunswick.

Overview of Affordance2Action (A2A). Our benchmark-centered learning framework first builds A2A-Bench through A2A-AffordGen, scaling one-to-many affordance annotations in natural multi-object scenes. A2A-Bench's supervision drives A2A-GroundingModel and A2A-Policy for real-time affordance grounding and robot manipulation.

Abstract

Task-conditioned manipulation requires grounding instructions to task-relevant functional parts rather than object categories. This setting is scene-dependent and often one-to-many in cluttered scenes: the same object may afford different interactions across tasks, while a single task may correspond to either one functional region or multiple valid functional regions, depending on the scene layout. Existing affordance datasets and benchmarks remain misaligned with this setting, as they typically focus on grasping or object-level affordances, rely on synthetic scenes, or assume a single instruction–region correspondence. We present Affordance2Action (A2A), a benchmark-centered learning framework for scene-level, task-conditioned part affordance grounding. At its core is A2A-Bench, a manipulation-oriented benchmark that covers both single-region and multi-region instruction correspondences in everyday scenes, with the latter highlighting the ambiguity and diversity of affordance grounding in realistic multi-object environments. To construct it at scale, we build A2A-AffordGen, an agent-assisted annotation pipeline that combines language-model filtering, interactive part segmentation, instance-level mask-out refinement, task-reasoning instruction generation, and human verification. A2A-Bench's supervision readily enables training a real-time grounding model (A2A-GroundingModel) and integrating its predictions into task-conditioned manipulation policies (A2A-Policy). Experiments show that A2A exposes substantial gaps in generic segmentation, VLM-based grounding, and affordance distillation baselines, while improving task-level localization and downstream manipulation.

Contributions

A2A-Bench

A scene-level, task-conditioned, one-to-many benchmark for robot-oriented part affordance grounding, associating manipulation intents with multiple actionable functional regions in real-world scenes.

A2A-AffordGen

An agent-assisted data construction pipeline that scales multi-object affordance annotation via language-model filtering, interactive part segmentation, instance-level mask-out refinement, task-reasoning instruction generation, and human verification.

A2A-GroundingModel & A2A-Policy

A2A-GroundingModel adapts SAM3 for real-time task-conditioned part grounding from image–instruction pairs, while A2A-Policy incorporates the predicted masks as structured visual priors for manipulation.

A2A-AffordGen: Agent-Assisted Annotation

A2A-AffordGen annotation pipeline
Starting from large-scale natural images, language-model filtering identifies manipulation-relevant tasks, then interactive part segmentation and an iterative mask-out strategy annotate multiple valid affordance regions per instruction.
Refining a SAM3 mask into precise affordance regions.
Annotating multiple masks from scratch in a multi-object scene.

A2A-GroundingModel: Real-Time Affordance Grounding

A2A grounding model architecture
We adapt SAM3 into a task-conditioned affordance part segmentor that predicts masks directly from image–instruction pairs — no point or box prompts at inference. Staged instruction adaptation with text-conditioned visual prompt injection bridges explicit part descriptions and implicit task intent while preserving SAM3's zero-shot priors.

Real-time task-conditioned affordance grounding from natural-language instructions.

“Hold the mug”
“Open the bottle” → “Hold the knife” (live task switch)
“Seat on the chair”

A2A-Policy: Affordance-Guided Manipulation

At the policy level, A2A-Policy integrates the masks predicted by A2A-GroundingModel into a language-conditioned manipulation policy as structured visual–action priors. Rather than forcing the policy to infer task-relevant regions implicitly from raw observations and demonstrations, the grounded functional regions explicitly direct the policy toward actionable parts of the scene. This converts A2A-Bench supervision into policy-ready guidance, reducing the data burden, improving interpretability, and supporting real-time deployment on a real robot arm.

Real-World Manipulation

A2A-Policy deployed on a real robot arm across everyday manipulation tasks.

Open the microwave oven
Place the blue cube on the wooden cube
Place the phone on the stand
Place the cup on the mat

Start–to–end rollouts comparing A2A-Implicit, A2A-Explicit, and UAD-DP on each task.

Open the microwave oven – rollout comparison
Stack the blue cube – rollout comparison
Place the phone on the stand – rollout comparison
Place the cup on the mat – rollout comparison

Comparison with Baselines

Qualitative comparison with baselines
A2A exposes substantial gaps in generic segmentation, VLM-based grounding, and affordance distillation baselines, while improving task-level localization.

BibTeX

@article{liu2026affordance2action,
  title   = {Affordance2Action: Task-Conditioned Scene-level Affordance
             Grounding for Real-Time Manipulation},
  author  = {Liu, Litao and Han, Yifan and Yi, Pengfei and Yu, Wenbo and
             Wang, Hanqing and Du, Haoran and Yuan, Enze and Yuan, Zilin and
             Feng, Ruiding and Liu, Michael and others},
  journal = {arXiv preprint arXiv:2606.04172},
  year    = {2026}
}