HV-MMBench

Yuxuan Cai†¹, Jiangning Zhang†², Zhenye Gan³, Qingdong He³, Xiaobin Hu³, Junwei Zhu³, Yabiao Wang³, Chengjie Wang³, Zhucun Xue², Xinwei He⁴, Xiang Bai¹,

¹Huazhong University of Science and Technology ²Zhejiang University ³Youtu Lab, Tencent ⁴Huazhong Agricultural University

Paper arXiv Code

🤗

Dataset

🏆

Leaderboard

Highlights

Compared to existing human-centric video benchmarks, our work offers the following key features:
(1) Diverse evaluation dimensions: HV-MMBench encompasses 14 tasks, ranging from basic attribute perception (e.g., age estimation, emotion recognition) to advanced cognitive reasoning (e.g., social relationship prediction, intention prediction), enabling comprehensive assessment of model capabilities;
(2) Varied data types: The benchmark includes multiple-choice, cloze, and open-ended question formats, combined with diverse evaluation metrics, to more accurately and robustly reflect model performance;
(3) Multi-domain video coverage: The benchmark spans 50 distinct visual scenarios, enabling comprehensive evaluation across fine-grained scene variations;
(4) Temporal coverage: The benchmark covers videos from short-term (10 seconds) to long-term (up to 30min) durations, supporting systematic analysis of models temporal reasoning abilities across diverse contextual lengths.
We evaluate several advanced open-source MLLMs on the HV-MMBench. While models excel in closed-form tasks, their performance drops sharply in open-ended generation, revealing a reliance on shallow patterns over genuine reasoning. In contrast, cloze and open-ended formats better expose reasoning challenges in human behavior understanding. By spanning diverse tasks and paradigms, HV-MMBench systematically reveals these limitations and facilitates the MLLM development.

Comparison of prominent benchmarks in the video domain. We highlight key attributes, including domain scope (Open vs. Human), the number of videos (\#Videos), the number of question-answer instances (\#QA Ins.), supported tasks, QA formats, evaluation metrics and resolution (Res). MC, FIB, TF, and OEQ are the abbreviation of multiple-choice, fill-in-the-blank, true/false, and open-ended questions, respectively.

Leaderboard

Model performance on HV-MMBench under the Multiple-Choice (MC), True/False (TF), Fill-in-Blank (FIB) and Open-Ended (OE) questions.

#	Model	LLM Params	Date	MC (%)	TF (%)	FIB (%)			OE (%)
#	Model	LLM Params	Date	Acc	Acc	Precision	Recall	F1	ScoreF	ScoreO	ScoreG
	VideoLLaMA2 DAMO Academy & Alibaba Group	7B	2025-05-22	81.7	80.4	4.31	0.55	0.97	0.19	0.35	0.56
	LLaVA-Video Bytedance & NTU S-Lab & BUPT	7B	2025-05-22	88.6	81.4	17.5	2.32	4.01	0.14	0.24	0.49
	Qwen2-VL Alibaba Group	7B	2025-05-22	84.3	84.1	13.0	1.64	2.81	0.15	0.33	0.56
	Qwen2.5-VL Alibaba Group	7B	2025-05-22	86.8	88.3	16.8	2.21	3.82	0.22	0.47	0.64
	Qwen2.5-VL Alibaba Group	32B	2025-05-22	86.8	89.9	19.7	2.48	4.33	0.19	0.51	0.69
	Intern2.5-VL Shanghai AI Laboratory	8B	2025-05-22	85.3	79.5	6.45	0.85	1.49	0.15	0.30	0.57
	Intern2.5-VL Shanghai AI Laboratory	38B	2025-05-22	-	-	6.22	0.85	1.45	0.17	0.37	0.53
	LLaVAOneVision ByteDance	7B	2025-05-22	91.1	84.9	11.7	1.52	2.66	-	-	-

ACC: Accuracy, ScoreF: Fuzzy Step-wise F1 Score that measures the overlap between predicted and ground-truth causal events, ScoreO: The structural consistency, computing the Longest Common Subsequence between the predicted and reference chains, ScoreG: The overall causal plausibility,

🚨 To submit your results to the leaderboard, please send model responses to this email.

Benchmark Overview

Overview of HV-MMBench that spans diverse human-centric scenarios (50+ domains in 10s to 30mins) and covers both basic perception and advanced reasoning tasks. It supports Multiple-Choice (MC), fill-in-blank (FIB), True/False (TF), and Open-Ended Questions (OEQ) to comprehensively evaluate MLLMs' understanding and cognitive capabilities.

Benchmark Construction

HV-MMBench construction pipeline. The benchmark is built through a three-stage pipeline: (1) large-scale Video Collection across diverse human-centric domains; (2) Automated QA annotation via MLLMs and structured templates; (3) a two-tier Quality Review combining automatic filtering and expert verification to ensure annotation reliability.

Performance of different open-sourced MLLMs

Performance of different open-sourced MLLMs on HV-MMBench under the Multiple-Choice and True/False questions, respectively.

Performance of different open-sourced MLLMs

Performance of MLLMs on HV-MMBench under the fill-in-blank questions.

Performance of different open-sourced MLLMs

Performance of MLLMs on HV-MMBench under the open-ended questions.

BibTeX


      @misc{cai2025hvmmbenchbenchmarkingmllmshumancentric,
            title={HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding}, 
            author={Yuxuan Cai and Jiangning Zhang and Zhenye Gan and Qingdong He and Xiaobin Hu and Junwei Zhu and Yabiao Wang and Chengjie Wang and Zhucun Xue and Xinwei He and Xiang Bai},
            year={2025},
            url={https://arxiv.org/abs/2507.04909}, 
      }

Highlights

Leaderboard

Benchmark

Benchmark Overview

Benchmark Construction

Experiments

Performance of different open-sourced MLLMs

Performance of different open-sourced MLLMs

Performance of different open-sourced MLLMs

BibTeX