Logo Contra4

Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D

1University of Pennsylvania,
2Salesforce Research
*Work done during internship at Salesforce.

Introduction

To achieve a deeper understanding of the world, AI must be able to reason across multiple modalities, such as images, audio, video, and 3D. While recent efforts have extended multimodal models to process multiple modalities, there is little evidence that they enable reasoning beyond two modalities simultaneously. This limitation arises partly from the challenge of constructing tasks that require reasoning across multiple modalities.

To address this, we introduce LogoContra4, a dataset designed to train and evaluate contrastive cross-modal reasoning over up to four modalities (audio, video, image, and 3D) simultaneously. Our approach unifies modalities through human-annotated captions and generates contrastive question-answer pairs, filtered via a mixture-of-models round-trip-consistency check.

Human inspection validates the high quality of Contra4, with 83.3% perceived correctness, while fine-tuning on the task results in a 56% relative accuracy improvement. Benchmarking against state-of-the-art models on a human annotated subset of 2.3k samples underscores the dataset’s challenge, with the best-performing model achieving only 56% accuracy on the full dataset and just 42% in four-modality settings.

Leaderboard

Accuracy scores on human annotated test set of Logo Contra4.

# Model Method LLM Base ALL Random All Random MC2 Random MC3 Random MC4 Similarity All Similarity MC2 Similarity MC3 Similarity MC4
# Model Method LLM Base ALL Random All Random MC2 Random MC3 Random MC4 Similarity All Similarity MC2 Similarity MC3 Similarity MC4
1 CREMA 🥇 MLLM 🖼️ FlanT5-xl 0.56 0.60 0.71 0.61 0.45 0.53 0.64 0.55 0.39
3 OneLLM-Finetuned MLLM 🖼️ LLaMA-2 7B-Finetuned 0.50 0.54 0.60 0.43 0.58 0.47 0.60 0.36 0.43
2 X-InstructBLIP MLLM 🖼️ Vicuna1.1 7B 0.32 0.31 0.47 0.30 0.13 0.33 0.48 0.27 0.22
3 OneLLM MLLM 🖼️ LLaMA-2 7B 0.32 0.31 0.52 0.16 0.24 0.34 0.52 0.22 0.27
4 Gemini-2.0* MLLM 🖼️ gemini-2.0-flash-exp 0.22 0.23 0.24 0.10 × 0.20 0.21 0.14 ×
5 Predicted Caption LLM 💬 LLaMA-3.1 7B 0.37 0.38 0.52 0.33 0.26 0.36 0.46 0.33 0.27

Method types: MLLM 🖼️: Cross-Modal model, LLM 💬: Large Language Model with Predicted Captions
MCX: Multiple choice with X options. Similarity: Negative sampling based on high similarity of captions. Random: Random negative sampling.
CREMA uses additional RGB signal for 3D inputs.
* Gemini is evaluated in examples that do not include 3D since it does not support 3D input as of yet.

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

Logo Contra4 Dataset

Overview

Logo Contra4 dataset introduced in this paper comprises 174k distinct training samples and 2.3k human-annotated test samples, designed to evaluate the cross-modal reasoning capabilities of large multimodal models. The inputs span up to four different modalities across audio, video, image and 3D sourced from 5 different captioning datasets using a mixture-of-models round-trip-consistency method to generate questions and verify answers. A human inspection of the data showed that our data generation pipeline achieves 83.3% accuracy in generating valid examples for fine-tuning.

Data Generation & Finetuning

Mixture-of-Models Round-Trip-Consistency data generation pipeline for Logo Contra4

Visualization

Ethics Statement

In conducting this research, we acknowledge the significant limitations and potential dangers associated with the use of Large Language Models (LLMs). One of the primary concerns is the presence of inherent biases within LLMs, which are a direct consequence of the data on which they are trained. These biases can inadvertently perpetuate harmful stereotypes and lead to discriminatory outcomes, particularly in sensitive applications. Additionally, LLMs, especially those with large parameter counts, may generate outputs that are factually incorrect or misleading, posing a risk in contexts that demand high levels of accuracy and reliability. To mitigate these risks we inspected samples of the dataset and used image sources that would limit the potential of generation of such harmful questions. However, we emphasize the importance of ongoing vigilance and the need for responsible use of these models and our dataset to prevent unintended negative consequences.

BibTeX


      TBD