Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D

Logo Contra4 evaluates cross-modal models across various modalities concurrently, unlike any previous work.

Accuracy scores of three recent cross-modal models (i.e. X-InstructBLIP, CREMA, and OneLLM) on our proposed dataset Logo Contra4 across different subsets of the annotated test set.

Introduction

To achieve a deeper understanding of the world, AI must be able to reason across multiple modalities, such as images, audio, video, and 3D. While recent efforts have extended multimodal models to process multiple modalities, there is little evidence that they enable reasoning beyond two modalities simultaneously. This limitation arises partly from the challenge of constructing tasks that require reasoning across multiple modalities.

To address this, we introduce Logo Contra4, a dataset designed to train and evaluate contrastive cross-modal reasoning over up to four modalities (audio, video, image, and 3D) simultaneously. Our approach unifies modalities through human-annotated captions and generates contrastive question-answer pairs, filtered via a mixture-of-models round-trip-consistency check.

Human inspection validates the high quality of Contra4, with 83.3% perceived correctness, while fine-tuning on the task results in a 56% relative accuracy improvement. Benchmarking against state-of-the-art models on a human annotated subset of 2.3k samples underscores the dataset’s challenge, with the best-performing model achieving only 56% accuracy on the full dataset and just 42% in four-modality settings.

Leaderboard

Accuracy scores on human annotated test set of Logo Contra4.

#	Model	Method	LLM Base	ALL	Random All	Random MC2	Random MC3	Random MC4	Similarity All	Similarity MC2	Similarity MC3	Similarity MC4
#	Model	Method	LLM Base	ALL	Random All	Random MC2	Random MC3	Random MC4	Similarity All	Similarity MC2	Similarity MC3	Similarity MC4
1	CREMA 🥇^†	MLLM 🖼️	FlanT5-xl	0.56	0.60	0.71	0.61	0.45	0.53	0.64	0.55	0.39
3	OneLLM-Finetuned	MLLM 🖼️	LLaMA-2 7B-Finetuned	0.50	0.54	0.60	0.43	0.58	0.47	0.60	0.36	0.43
2	X-InstructBLIP	MLLM 🖼️	Vicuna1.1 7B	0.32	0.31	0.47	0.30	0.13	0.33	0.48	0.27	0.22
3	OneLLM	MLLM 🖼️	LLaMA-2 7B	0.32	0.31	0.52	0.16	0.24	0.34	0.52	0.22	0.27
4	Gemini-2.0^*	MLLM 🖼️	gemini-2.0-flash-exp	0.22	0.23	0.24	0.10	×	0.20	0.21	0.14	×
5	Predicted Caption	LLM 💬	LLaMA-3.1 7B	0.37	0.38	0.52	0.33	0.26	0.36	0.46	0.33	0.27

Method types: MLLM 🖼️: Cross-Modal model, LLM 💬: Large Language Model with Predicted Captions
MCX: Multiple choice with X options. Similarity: Negative sampling based on high similarity of captions. Random: Random negative sampling.
^† CREMA uses additional RGB signal for 3D inputs.
* Gemini is evaluated in examples that do not include 3D since it does not support 3D input as of yet.

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

Overview

Logo Contra4 dataset introduced in this paper comprises 174k distinct training samples and 2.3k human-annotated test samples, designed to evaluate the cross-modal reasoning capabilities of large multimodal models. The inputs span up to four different modalities across audio, video, image and 3D sourced from 5 different captioning datasets using a mixture-of-models round-trip-consistency method to generate questions and verify answers. A human inspection of the data showed that our data generation pipeline achieves 83.3% accuracy in generating valid examples for fine-tuning.

Examples across different categories

Data Generation & Finetuning

Mixture-of-Models Round-Trip-Consistency data generation pipeline for Logo Contra4

Pipeline Overview: We start with datasets from different modalities, each paired with captions.

Step 1: Negative Sampling Selection – Select challenging negative examples using either similar captions or random pairing.

Step 2: Question Generation – Use a language model to generate questions that require reasoning across multiple modalities.

Step 3: Answer-Explanation Generation – Ask the model to answer and explain its reasoning based on the captions.

Step 4: Round-Trip-Consistency Check - Validate each example by checking whether multiple models can reach the same answer and explanation. Only high-agreement examples are kept.

We apply a round-trip-consistency check using multiple language models to verify each sample. Samples are retained only if they pass filters such as majority agreement (MF), unanimous agreement (UF), and their permutation-aware variants (PMF, PUF). These filtering strategies significantly boost model accuracy over training iterations especially in the earlier stages.

Visualization

Ethics Statement

In conducting this research, we acknowledge the significant limitations and potential dangers associated with the use of Large Language Models (LLMs). One of the primary concerns is the presence of inherent biases within LLMs, which are a direct consequence of the data on which they are trained. These biases can inadvertently perpetuate harmful stereotypes and lead to discriminatory outcomes, particularly in sensitive applications. Additionally, LLMs, especially those with large parameter counts, may generate outputs that are factually incorrect or misleading, posing a risk in contexts that demand high levels of accuracy and reliability. To mitigate these risks we inspected samples of the dataset and used image sources that would limit the potential of generation of such harmful questions. However, we emphasize the importance of ongoing vigilance and the need for responsible use of these models and our dataset to prevent unintended negative consequences.

BibTeX

TBD

Contra4

Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D