Accuracy scores on human annotated test set of
Contra4.
# | Model | Method | LLM Base | ALL | Random All | Random MC2 | Random MC3 | Random MC4 | Similarity All | Similarity MC2 | Similarity MC3 | Similarity MC4 |
# | Model | Method | LLM Base | ALL | Random All | Random MC2 | Random MC3 | Random MC4 | Similarity All | Similarity MC2 | Similarity MC3 | Similarity MC4 |
1 | CREMA 🥇† | MLLM 🖼️ | FlanT5-xl | 0.56 | 0.60 | 0.71 | 0.61 | 0.45 | 0.53 | 0.64 | 0.55 | 0.39 |
3 | OneLLM-Finetuned | MLLM 🖼️ | LLaMA-2 7B-Finetuned | 0.50 | 0.54 | 0.60 | 0.43 | 0.58 | 0.47 | 0.60 | 0.36 | 0.43 |
2 | X-InstructBLIP | MLLM 🖼️ | Vicuna1.1 7B | 0.32 | 0.31 | 0.47 | 0.30 | 0.13 | 0.33 | 0.48 | 0.27 | 0.22 |
3 | OneLLM | MLLM 🖼️ | LLaMA-2 7B | 0.32 | 0.31 | 0.52 | 0.16 | 0.24 | 0.34 | 0.52 | 0.22 | 0.27 |
4 | Gemini-2.0* | MLLM 🖼️ | gemini-2.0-flash-exp | 0.22 | 0.23 | 0.24 | 0.10 | × | 0.20 | 0.21 | 0.14 | × |
5 | Predicted Caption | LLM 💬 | LLaMA-3.1 7B | 0.37 | 0.38 | 0.52 | 0.33 | 0.26 | 0.36 | 0.46 | 0.33 | 0.27 |
Method types: MLLM 🖼️: Cross-Modal model, LLM 💬: Large Language Model with Predicted Captions
MCX: Multiple choice with X options. Similarity: Negative sampling based on high similarity of captions. Random: Random negative sampling.
† CREMA uses additional RGB signal for 3D inputs.
* Gemini is evaluated in examples that do not include 3D since it does not support 3D input as of yet.
🚨 To submit your results to the leaderboard, please send to this email with your result json files.