Affordance2Action (A2A)

Overview of Affordance2Action (A2A). Our benchmark-centered learning framework first builds A2A-Bench through A2A-AffordGen, scaling one-to-many affordance annotations in natural multi-object scenes. A2A-Bench's supervision drives A2A-GroundingModel and A2A-Policy for real-time affordance grounding and robot manipulation.

Abstract

Task-conditioned manipulation requires grounding instructions to task-relevant functional parts rather than object categories. This setting is scene-dependent and often one-to-many in cluttered scenes: the same object may afford different interactions across tasks, while a single task may correspond to either one functional region or multiple valid functional regions, depending on the scene layout. Existing affordance datasets and benchmarks remain misaligned with this setting, as they typically focus on grasping or object-level affordances, rely on synthetic scenes, or assume a single instruction–region correspondence. We present Affordance2Action (A2A), a benchmark-centered learning framework for scene-level, task-conditioned part affordance grounding. At its core is A2A-Bench, a manipulation-oriented benchmark that covers both single-region and multi-region instruction correspondences in everyday scenes, with the latter highlighting the ambiguity and diversity of affordance grounding in realistic multi-object environments. To construct it at scale, we build A2A-AffordGen, an agent-assisted annotation pipeline that combines language-model filtering, interactive part segmentation, instance-level mask-out refinement, task-reasoning instruction generation, and human verification. A2A-Bench's supervision readily enables training a real-time grounding model (A2A-GroundingModel) and integrating its predictions into task-conditioned manipulation policies (A2A-Policy). Experiments show that A2A exposes substantial gaps in generic segmentation, VLM-based grounding, and affordance distillation baselines, while improving task-level localization and downstream manipulation.

Contributions

A2A-Bench

A scene-level, task-conditioned, one-to-many benchmark for robot-oriented part affordance grounding, associating manipulation intents with multiple actionable functional regions in real-world scenes.

A2A-AffordGen

An agent-assisted data construction pipeline that scales multi-object affordance annotation via language-model filtering, interactive part segmentation, instance-level mask-out refinement, task-reasoning instruction generation, and human verification.

A2A-GroundingModel & A2A-Policy

A2A-GroundingModel adapts SAM3 for real-time task-conditioned part grounding from image–instruction pairs, while A2A-Policy incorporates the predicted masks as structured visual priors for manipulation.

A2A-AffordGen: Agent-Assisted Annotation

A2A-AffordGen annotation pipeline — Starting from large-scale natural images, language-model filtering identifies manipulation-relevant tasks, then interactive part segmentation and an iterative mask-out strategy annotate *multiple* valid affordance regions per instruction.

Refining a SAM3 mask into precise affordance regions.

Annotating multiple masks from scratch in a multi-object scene.

A2A-GroundingModel: Real-Time Affordance Grounding

A2A grounding model architecture — We adapt SAM3 into a task-conditioned affordance part segmentor that predicts masks directly from image–instruction pairs — no point or box prompts at inference. Staged instruction adaptation with text-conditioned visual prompt injection bridges explicit part descriptions and implicit task intent while preserving SAM3's zero-shot priors.

Real-time task-conditioned affordance grounding from natural-language instructions.

“Hold the mug”

“Open the bottle” → “Hold the knife” (live task switch)

“Seat on the chair”

A2A-Policy: Affordance-Guided Manipulation

At the policy level, A2A-Policy integrates the masks predicted by A2A-GroundingModel into a language-conditioned manipulation policy as structured visual–action priors. Rather than forcing the policy to infer task-relevant regions implicitly from raw observations and demonstrations, the grounded functional regions explicitly direct the policy toward actionable parts of the scene. This converts A2A-Bench supervision into policy-ready guidance, reducing the data burden, improving interpretability, and supporting real-time deployment on a real robot arm.

Real-World Manipulation

A2A-Policy deployed on a real robot arm across everyday manipulation tasks.

Open the microwave oven

Place the blue cube on the wooden cube

Place the phone on the stand

Place the cup on the mat

Start–to–end rollouts comparing A2A-Implicit, A2A-Explicit, and UAD-DP on each task.

Open the microwave oven – rollout comparison

Stack the blue cube – rollout comparison

Place the phone on the stand – rollout comparison

Place the cup on the mat – rollout comparison

Comparison with Baselines

BibTeX

@article{liu2026affordance2action,
  title   = {Affordance2Action: Task-Conditioned Scene-level Affordance
             Grounding for Real-Time Manipulation},
  author  = {Liu, Litao and Han, Yifan and Yi, Pengfei and Yu, Wenbo and
             Wang, Hanqing and Du, Haoran and Yuan, Enze and Yuan, Zilin and
             Feng, Ruiding and Liu, Michael and others},
  journal = {arXiv preprint arXiv:2606.04172},
  year    = {2026}
}