Illustration of CFG-Bench's focus on embodied intelligence over descriptive accuracy. The top part shows how FAVOR-Bench annotates and questions from a third-person perspective, a task which current MLLMs can often solve. In contrast, the bottom part demonstrates CFG-Bench's fine-grained annotation and first-person scenario questions, which probes for the actionable physical and intentional details necessary for embodied agents. Current MLLMs struggle to master the crucial fine-grained details required for physical interaction.
Task demonstration of CFG-Bench. Note: all QA pairs, including those above, are slightly simplified for clarity and brevity.
Data statistics of CFG-Bench. (a) Distribution and video length statistics of the five datasets. (b) The distribution of tasks across four tiers. AW means average words of questions.
Comparison with other benchmarks. CFG-Bench introduces a four-tiered cognitive framework for embodied fine-grained intelligence, distinguishing it from existing action benchmarks focused on third-person coarse description and other embodied benchmarks that prioritize spatial reasoning and high-level planning over the fine-grained physical action. It also provides a more comprehensive evaluation protocol by uniquely integrating three-modalities QAs. ✓ indicates partial coverage.
Comprehensive results of leading MLLMs on CFG-Bench. It presents the performance of each task within every cognitive tier, along with the average score for open-ended questions Avgo and the average accuracy for close-ended questions Avgc. Random selection and human performance are also included for comparison. The highest and suboptimal results are bolded and underlined.
The robotic arm approaches the glass and the bottle.
The robotic arm picks up the bottle.
The robotic arm tilts the bottle and pours the liquid into the glass.
The robotic hand is positioned on the left side of the frame. It slowly moves toward the glass bottle on the table. The robotic hand then grips the bottle's handle with its fingers, lifting it slightly off the table. As the bottle is lifted, the robotic hand rotates it counterclockwise to align the spout with the glass below. The hand then tilts the bottle and pours the liquid into the glass.
Person opens a bottle of vanilla extract with their right hand.
Person scoops vanilla extract with a spoon using their right hand.
Person pours the vanilla extract into a cup using their right hand.
The fingers of both hands are opening a bag placed on the table. During the process, they slowly lift and rotate the bag once. Then, the left hand holds the bag while the right hand uses the fingers to open its mouth. The right hand takes a spoon from the cup on the table, lifts it up, then inserts it into the bag, slowly scoops out some black powder, pours the powder into the cup.
@article{liu2025beyond,
title={Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents},
author={Liu, Dayong and Xu, Chao and Chen, Weihong and Zhang, Suyu and Wang, Juncheng and Deng, Jiankang and Sun, Baigui and Liu, Yang},
journal={arXiv preprint arXiv:2511.18685},
year={2025}
}