*equal mentorship
Vision-language pre-training and instruction tuning have demonstrated general-purpose capabilities in 2D visual reasoning tasks by aligning visual encoders with state-of-the-art large language models (LLMs). In this paper, we introduce a simple, yet effective, cross-modality framework built atop frozen LLMs that allows the integration of various modalities without extensive modality-specific customization.
To facilitate instruction-modality fine-tuning, we collect high-quality instruction tuning data in an automatic and scalable manner, composed of 24K QA samples for audio and 250K QA samples for 3D. Leveraging instruction-aware representations, our model performs comparably with leading-edge counterparts without the need of extensive modality-specific pre-training or customization. Furthermore, our approach demonstrates cross-modal reasoning abilities across two or more input modalities, despite each modality projection being trained individually.
To study the model's cross-modal abilities, we contribute a novel Discriminative Cross-modal Reasoning (DisCRn) evaluation task, comprising 9K audio-video QA samples and 28K image-3D QA samples that require the model to reason discriminatively across disparate input modalities.
What do you hear in the audio?
birds are singing and a stream is flowing in the background
Where is this instrument from?
India
A short description.
3D model of a triceratops dinosaur.
In which city is this statue located?
New York
Where is this video taking place?
Paris
What is happening in the video?
A group of people flying kites on the beach at sunset.
Does the instrument in the 3d model play in the audio?
No
Does the instrument in the 3d model play in the audio?
Yes
Does the creature have a tail?
[image only] No
[image+3d] Yes
X-InstructBLIP is a simple and effective, scalable cross-modal framework to empower LLMs to handle a diverse range of tasks across a variety of modalities, without requiring modality-specific pre-training. While X-InstructBLIP utilizes distinct pre-trained encoders for each modality, aligning them into the language domain through independently trained, instruction-aware Q-Formers, it demonstrates emergent capabilities in cross-modal comprehension.
Each modality Q-Former transforms a set of K trainable query tokens conditioned on both the instruction and the extra-linguistic modality input into instruction-aware language representation of the modality. The Q-Former module consists of two transformer submodules that share the same self-attention layers: one submodule interacts with the output of the modality encoder and the other is a BERT-base text transformer that serves as both an encoder and decoder. Each Q-Former is initialized with the pre-trained weights from BLIP-2, without the cross-attention layers due to a dimension mismatch between the image encoder in BLIP-2 and the other modality encoders. The modality embedding interacts with the instruction text and input query tokens via cross-attention layers inserted every other transformer block. The transormed output query tokens are linearly projected to the frozen LLM's space through a learnable projection layer. Finally, the input to the LLM is the modality prefix, the transformed output query tokens in the LLM space, and the instruction. The model is optimized under a causal language modeling objective using the ground truth outputs for each example.
Given two distinct modality inputs, the model needs to select the entity that matches the property queried. Audio is symbolized by waveforms and its semantics are conveyed via annotated captions, 3D is illustrated through a point cloud visualization, and videos are represented through the display of two random frames. X-InstructBLIP outperforms a robust captioning baseline, but the task remains an open challenge. This task requires models not only to discriminate the inherent characteristics of the involved modalities but also to consider their relative positioning in the input. X-InstructBLIP outperforms a strong captioning baseline that leverages state of the art captioning-models to generate descriptions for each modality.
Image-3D | Audio-Video | |
---|---|---|
Caption Baseline | 41.8 | 30.8 |
X-InstructBLIP (7b) | 48.1 | 34.0 | X-InstructBLIP (13b) | 48.8 | 45.6 |
@misc{panagopoulou2023xinstructblip,
title={X-InstructBLIP: A Framework for aligning X-Modal instruction-aware
representations to LLMs and Emergent Cross-modal Reasoning},
author={Artemis Panagopoulou and Le Xue and Ning Yu and Junnan Li and Dongxu Li and
Shafiq Joty and Ran Xu and Silvio Savarese and Caiming Xiong and Juan Carlos Niebles},
year={2023},
eprint={2311.18799},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
In this research, we present a framework for aligning multiple modalities with a frozen large language model (LLM). Our methodology strictly involves the use of publicly available and free datasets, ensuring we do not engage in the collection of private data. However, it is crucial to acknowledge that publicly sourced datasets carry implicit biases. These biases reflect historical and societal inequalities, potentially influencing the model's outputs. Our framework builds upon a pre-existing frozen LLM. While this approach benefits from the extensive knowledge encoded within the LLM, it is important to recognize that such models can inherently propagate biases present in their training data. Additionally, there is a non-negligible risk of generating false or misleading information. While there exist tools to measure language model toxicity such as HELM, their evaluation datasets are constrained in the language modality, and hence are not applicable to measure toxicity across modalities which is the focus of this work. We leave the generation of cross-modal datasets for toxicity and bias measurement as a future research direction. Users of our framework should be aware of these limitations and exercise caution, particularly in applications where the accuracy and impartiality of outputs are critical. We advocate for responsible use of our framework, especially in sensitive contexts. Users should critically assess and verify the model's outputs and consider the potential for reinforcing biases or spreading misinformation. Furthermore, we commit to transparency regarding our model's capabilities and limitations. All code, data, and model weights will be released to ensure reproducibility and encourage external evaluation and subsequent research.