logo
1Huazhong University of Science and Technology   2Zhejiang University   3Youtu Lab, Tencent   4Huazhong Agricultural University  

Highlights

Compared to existing human-centric video benchmarks, our work offers the following key features:
(1) Diverse evaluation dimensions: HV-MMBench encompasses 14 tasks, ranging from basic attribute perception (e.g., age estimation, emotion recognition) to advanced cognitive reasoning (e.g., social relationship prediction, intention prediction), enabling comprehensive assessment of model capabilities;
(2) Varied data types: The benchmark includes multiple-choice, cloze, and open-ended question formats, combined with diverse evaluation metrics, to more accurately and robustly reflect model performance;
(3) Multi-domain video coverage: The benchmark spans 50 distinct visual scenarios, enabling comprehensive evaluation across fine-grained scene variations;
(4) Temporal coverage: The benchmark covers videos from short-term (10 seconds) to long-term (up to 30min) durations, supporting systematic analysis of models temporal reasoning abilities across diverse contextual lengths.
We evaluate several advanced open-source MLLMs on the HV-MMBench. While models excel in closed-form tasks, their performance drops sharply in open-ended generation, revealing a reliance on shallow patterns over genuine reasoning. In contrast, cloze and open-ended formats better expose reasoning challenges in human behavior understanding. By spanning diverse tasks and paradigms, HV-MMBench systematically reveals these limitations and facilitates the MLLM development.

Statistic image
Statistic table

Comparison of prominent benchmarks in the video domain. We highlight key attributes, including domain scope (Open vs. Human), the number of videos (\#Videos), the number of question-answer instances (\#QA Ins.), supported tasks, QA formats, evaluation metrics and resolution (Res). MC, FIB, TF, and OEQ are the abbreviation of multiple-choice, fill-in-the-blank, true/false, and open-ended questions, respectively.

Leaderboard

Model performance on HV-MMBench under the Multiple-Choice (MC), True/False (TF), Fill-in-Blank (FIB) and Open-Ended (OE) questions.

# Model LLM
Params
Date MC (%) TF (%) FIB (%) OE (%)
Acc Acc Precision Recall F1 ScoreF ScoreO ScoreG
VideoLLaMA2

DAMO Academy & Alibaba Group

7B 2025-05-22 81.7 80.4 4.31 0.55 0.97 0.19 0.35 0.56
LLaVA-Video

Bytedance & NTU S-Lab & BUPT

7B 2025-05-22 88.6 81.4 17.5 2.32 4.01 0.14 0.24 0.49
Qwen2-VL

Alibaba Group

7B 2025-05-22 84.3 84.1 13.0 1.64 2.81 0.15 0.33 0.56
Qwen2.5-VL

Alibaba Group

7B 2025-05-22 86.8 88.3 16.8 2.21 3.82 0.22 0.47 0.64
Qwen2.5-VL

Alibaba Group

32B 2025-05-22 86.8 89.9 19.7 2.48 4.33 0.19 0.51 0.69
Intern2.5-VL

Shanghai AI Laboratory

8B 2025-05-22 85.3 79.5 6.45 0.85 1.49 0.15 0.30 0.57
Intern2.5-VL

Shanghai AI Laboratory

38B 2025-05-22 - - 6.22 0.85 1.45 0.17 0.37 0.53
LLaVAOneVision

ByteDance

7B 2025-05-22 91.1 84.9 11.7 1.52 2.66 - - -
ACC: Accuracy, ScoreF: Fuzzy Step-wise F1 Score that measures the overlap between predicted and ground-truth causal events, ScoreO: The structural consistency, computing the Longest Common Subsequence between the predicted and reference chains, ScoreG: The overall causal plausibility,

๐Ÿšจ To submit your results to the leaderboard, please send model responses to this email.

Benchmark

Benchmark Overview

benchmarks

Overview of HV-MMBench that spans diverse human-centric scenarios (50+ domains in 10s to 30mins) and covers both basic perception and advanced reasoning tasks. It supports Multiple-Choice (MC), fill-in-blank (FIB), True/False (TF), and Open-Ended Questions (OEQ) to comprehensively evaluate MLLMs' understanding and cognitive capabilities.

Benchmark Construction

benchmarks

HV-MMBench construction pipeline. The benchmark is built through a three-stage pipeline: (1) large-scale Video Collection across diverse human-centric domains; (2) Automated QA annotation via MLLMs and structured templates; (3) a two-tier Quality Review combining automatic filtering and expert verification to ensure annotation reliability.

Experiments

Performance of different open-sourced MLLMs

Performance of different open-sourced MLLMs on HV-MMBench under the Multiple-Choice and True/False questions, respectively.

Performance of different open-sourced MLLMs

Performance of MLLMs on HV-MMBench under the fill-in-blank questions.

Performance of different open-sourced MLLMs

Performance of MLLMs on HV-MMBench under the open-ended questions.

BibTeX


      @misc{cai2025hvmmbenchbenchmarkingmllmshumancentric,
            title={HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding}, 
            author={Yuxuan Cai and Jiangning Zhang and Zhenye Gan and Qingdong He and Xiaobin Hu and Junwei Zhu and Yabiao Wang and Chengjie Wang and Zhucun Xue and Xinwei He and Xiang Bai},
            year={2025},
            url={https://arxiv.org/abs/2507.04909}, 
      }