Logo Mementos

A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

1 University of Maryland, College Park
2 UNC-Chapel Hill, Chapel Hill
*Indicates Equal Advising

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs’ sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs’ sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of cooccurring behaviors, and the compounding impact of behavioral hallucinations.

Leaderboard

Recall. Precision, and F1 scores of Object and Behavior on Val set of Logo Mementos.

# Model Input type Source Date Avg Object-Recall Object-Precision Object-F1 Behavior-Recall Behavior-Precision Behavior-F1
1 GPT-4V 🥇 Sequential Link 2024-01-20 45.68 60.24 54.13 55.36 42.36 29.40 32.58
2 Gemini 🥈 Sequential Link 2024-01-20 33.98 38.36 43.12 38.91 26.28 31.01 26.18
3 LLaVA-1.5 🥉 Combined Link 2024-01-20 32.78 36.90 46.14 39.29 22.09 29.22 23.01
4 Chat-UniVi Sequential Link 2024-01-20 31.69 39.09 38.26 37.06 25.36 26.67 23.74
5 Gemini Combined Link 2024-01-20 30.44 33.28 39.47 34.42 26.76 25.38 23.33
6 GPT-4v Combined Link 2024-01-20 30.13 35.41 36.34 34.46 30.70 20.82 23.07
7 mPLUG_Owl-v2 Combined Link 2024-01-20 28.26 28.51 40.65 32.20 19.74 27.81 20.64
8 InstructBLIP Combined Link 2024-01-20 27.10 27.37 33.86 28.77 23.98 25.69 22.92
9 Chat-UniVi Combined Link 2024-01-20 25.67 30.14 32.24 29.86 20.32 21.97 19.52
10 Video-LLaMA-2 Sequential Link 2024-01-20 21.13 25.59 23.50 23.35 16.21 21.47 16.62
11 MiniGPT4 Combined Link 2024-01-20 18.73 25.33 17.95 20.01 16.02 17.82 15.26
12 MiniGPT5 Combined Link 2024-01-20 18.28 24.58 17.69 19.44 15.04 17.93 15.02

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

🚨 For more submission details, please refer to Evaluation.

Logo Mementos Dataset

Overview

Logo Mementos is a comprehensive benchmark designed to evaluate the reasoning capability of Multimodal Large Language Models (MLLMs) over image sequences. It includes 4,761 image sequences of varying lengths. The image sequences in Mementos are categorized into three domains: Daily-life, Robotics, and Comics. This diverse collection is crucial for evaluating the comprehensive time-varying reasoning abilities of MLLMs. Specifically, the robotics data, closely associated with embodied AI or real-world contexts, and the comic-style storyboard data, rich in stylistic and episodic diversity, significantly enhance the benchmark’s relevance and robustness.

Dataset Statistics

All the data are divided into training and validation sets.

  • training: 4,062 image sequences used for MLLM training or finetuning.
  • validation: 699 image sequences for evaluation.
You can download the dataset on Dataset.

GPT-4-assisted Evaluation

We employ a GPT-4-assisted evaluation procedure: after an MLLM produces a description for an image sequence, we extract behavior and object keywords from both AI-generated and human-annotated descriptions using GPT-4, then use keyword matching to assess the degree of behavioral and object hallucinations.

Evaluation Results on Existing MLLMs

GPT-4V with s-input demonstrates the best reasoning capability compared with all other MLLMs in understanding image sequences. Among open-source models, LLaVA1.5 performs the best, nearly matching or even surpassing the black-box model Gemini in object comprehension, but its ability to infer behaviors from image sequences is weaker compared to Gemini and GPT-4V.

Citation

If you find our work useful, please consider citing the paper as follows:


@misc{wang2024mementos,
      title={Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences}, 
      author={Xiyao Wang and Yuhang Zhou and Xiaoyu Liu and Hongjin Lu and Yuancheng Xu and Feihong He and Jaehong Yoon and Taixi Lu and Gedas Bertasius and Mohit Bansal and Huaxiu Yao and Furong Huang},
      year={2024},
      eprint={2401.10529},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
      }