X-InstructBLIP

A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning

Artemis Panagopoulou2,  Le Xue*, 1,  Ning Yu*, 1,  Junnan Li1,  Dongxu Li1,  Shafiq Joty1,  Ran Xu1,  Silvio Savarese1,  Caiming Xiong1,  Juan Carlos Niebles1

*equal mentorship

1. Salesforce AI Research   2. University of Pennsylvania   
salesforce logo penn logo

Abstract

Vision-language pre-training and instruction tuning have demonstrated general-purpose capabilities in 2D visual reasoning tasks by aligning visual encoders with state-of-the-art large language models (LLMs). In this paper, we introduce a simple, yet effective, cross-modality framework built atop frozen LLMs that allows the integration of various modalities without extensive modality-specific customization.

To facilitate instruction-modality fine-tuning, we collect high-quality instruction tuning data in an automatic and scalable manner, composed of 24K QA samples for audio and 250K QA samples for 3D. Leveraging instruction-aware representations, our model performs comparably with leading-edge counterparts without the need of extensive modality-specific pre-training or customization. Furthermore, our approach demonstrates cross-modal reasoning abilities across two or more input modalities, despite each modality projection being trained individually.

To study the model's cross-modal abilities, we contribute a novel Discriminative Cross-modal Reasoning (DisCRn) evaluation task, comprising 9K audio-video QA samples and 28K image-3D QA samples that require the model to reason discriminatively across disparate input modalities.

Single Modality Examples

triceratops 3d model
statue of liberty 3d model

Cross-Modal Reasoning Examples

guitar 3d model
pianoforte 3d model
pokemon img pokemon img

Overview

X-InstructBLIP main

X-InstructBLIP is a simple and effective, scalable cross-modal framework to empower LLMs to handle a diverse range of tasks across a variety of modalities, without requiring modality-specific pre-training. While X-InstructBLIP utilizes distinct pre-trained encoders for each modality, aligning them into the language domain through independently trained, instruction-aware Q-Formers, it demonstrates emergent capabilities in cross-modal comprehension.

Method

X-InstructBLIP main

Each modality Q-Former transforms a set of K trainable query tokens conditioned on both the instruction and the extra-linguistic modality input into instruction-aware language representation of the modality. The Q-Former module consists of two transformer submodules that share the same self-attention layers: one submodule interacts with the output of the modality encoder and the other is a BERT-base text transformer that serves as both an encoder and decoder. Each Q-Former is initialized with the pre-trained weights from BLIP-2, without the cross-attention layers due to a dimension mismatch between the image encoder in BLIP-2 and the other modality encoders. The modality embedding interacts with the instruction text and input query tokens via cross-attention layers inserted every other transformer block. The transormed output query tokens are linearly projected to the frozen LLM's space through a learnable projection layer. Finally, the input to the LLM is the modality prefix, the transformed output query tokens in the LLM space, and the instruction. The model is optimized under a causal language modeling objective using the ground truth outputs for each example.

New Benchmark: Discriminative Cross-Modal Reasoning

Discrn Examples

Given two distinct modality inputs, the model needs to select the entity that matches the property queried. Audio is symbolized by waveforms and its semantics are conveyed via annotated captions, 3D is illustrated through a point cloud visualization, and videos are represented through the display of two random frames. X-InstructBLIP outperforms a robust captioning baseline, but the task remains an open challenge. This task requires models not only to discriminate the inherent characteristics of the involved modalities but also to consider their relative positioning in the input. X-InstructBLIP outperforms a strong captioning baseline that leverages state of the art captioning-models to generate descriptions for each modality.

Image-3D Audio-Video
Caption Baseline 41.8 30.8
X-InstructBLIP (7b) 48.1 34.0
X-InstructBLIP (13b) 48.8 45.6

Citation

 
                            
                        @misc{panagopoulou2023xinstructblip,
                            title={X-InstructBLIP: A Framework for aligning X-Modal instruction-aware 
                                representations to LLMs and Emergent Cross-modal Reasoning}, 
                            author={Artemis Panagopoulou and Le Xue and Ning Yu and Junnan Li and Dongxu Li and 
                                Shafiq Joty and Ran Xu and Silvio Savarese and Caiming Xiong and Juan Carlos Niebles},
                            year={2023},
                            eprint={2311.18799},
                            archivePrefix={arXiv},
                            primaryClass={cs.CV}
                          }
                        
                        

Ethics Statement

In this research, we present a framework for aligning multiple modalities with a frozen large language model (LLM). Our methodology strictly involves the use of publicly available and free datasets, ensuring we do not engage in the collection of private data. However, it is crucial to acknowledge that publicly sourced datasets carry implicit biases. These biases reflect historical and societal inequalities, potentially influencing the model's outputs. Our framework builds upon a pre-existing frozen LLM. While this approach benefits from the extensive knowledge encoded within the LLM, it is important to recognize that such models can inherently propagate biases present in their training data. Additionally, there is a non-negligible risk of generating false or misleading information. While there exist tools to measure language model toxicity such as HELM, their evaluation datasets are constrained in the language modality, and hence are not applicable to measure toxicity across modalities which is the focus of this work. We leave the generation of cross-modal datasets for toxicity and bias measurement as a future research direction. Users of our framework should be aware of these limitations and exercise caution, particularly in applications where the accuracy and impartiality of outputs are critical. We advocate for responsible use of our framework, especially in sensitive contexts. Users should critically assess and verify the model's outputs and consider the potential for reinforcing biases or spreading misinformation. Furthermore, we commit to transparency regarding our model's capabilities and limitations. All code, data, and model weights will be released to ensure reproducibility and encourage external evaluation and subsequent research.