publications | Artemis Panagopoulou

2024

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

Le Xue, Ning Yu, Shu Zhang, and 8 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
Evaluating Vision-Language Models on Bistable Images

Artemis Panagopoulou, Coby Melkin, and Chris Callison-Burch

In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, Aug 2024

Abs

Bistable images, also known as ambiguous or reversible images, present visual stimuli that can be seen in two distinct interpretations, though not simultaneously, by the observer. In this study, we conduct the most extensive examination of vision-language models using bistable images to date. We manually gathered a dataset of 29 bistable images, along with their associated labels, and subjected them to 121 different manipulations in brightness, resolution, tint, and rotation. We evaluated twelve different models in both classification and generative tasks across six model architectures. Our findings reveal that, with the exception of models from the Idefics family and LLaVA1.5-13b, there is a pronounced preference for one interpretation over another among the models, and minimal variance under image manipulations, with few exceptions on image rotations. Additionally, we compared the models’ preferences with humans, noting that the models do not exhibit the same continuity biases as humans and often diverge from human initial interpretations. We also investigated the influence of variations in prompts and the use of synonymous labels, discovering that these factors significantly affect model interpretations more than image manipulations showing a higher influence of the language priors on bistable image interpretations compared to image-text training data. All code and data is open sourced.
X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning

Artemis Panagopoulou, Le Xue, Ning Yu, and 7 more authors

Aug 2024
ViUniT: Visual Unit Tests for More Robust Visual Programming

Artemis Panagopoulou, Honglu Zhou, Silvio Savarese, and 4 more authors

arXiv preprint arXiv:2412.08859, Aug 2024

2023

Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification

Yue Yang, Artemis Panagopoulou, Shenghao Zhou, and 3 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2023
I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors

Tuhin Chakrabarty, Arkadiy Saakyan, Olivia Winn, and 4 more authors

In Findings of the Association for Computational Linguistics: ACL 2023, Jul 2023

Abs

Visual metaphors are powerful rhetorical devices used to persuade or communicate creative ideas through images. Similar to linguistic metaphors, they convey meaning implicitly through symbolism and juxtaposition of the symbols. We propose a new task of generating visual metaphors from linguistic metaphors. This is a challenging task for diffusion-based text-to-image models, such as DALL\cdotE 2, since it requires the ability to model implicit meaning and compositionality. We propose to solve the task through the collaboration between Large Language Models (LLMs) and Diffusion Models: Instruct GPT-3 (davinci-002) with Chain-of-Thought prompting generates text that represents a visual elaboration of the linguistic metaphor containing the implicit meaning and relevant objects, which is then used as input to the diffusion-based text-to-image models. Using a human-AI collaboration framework, where humans interact both with the LLM and the top-performing diffusion model, we create a high-quality dataset containing 6,476 visual metaphors for 1,540 linguistic metaphors and their associated visual elaborations. Evaluation by professional illustrators shows the promise of LLM-Diffusion Model collaboration for this task.To evaluate the utility of our Human-AI collaboration framework and the quality of our dataset, we perform both an intrinsic human-based evaluation and an extrinsic evaluation using visual entailment as a downstream task.

2022

QuakerBot: A Household Dialog System Powered by Large Language Models

Artemis Panagopoulou, Manni Arora Li Zhang Dimitri Cugini, Weiqiu You, and 6 more authors

Alexa Prize TaskBot Challenge Proceedings, Jul 2022

Bib

@article{panagopoulouquakerbot,
  title = {QuakerBot: A Household Dialog System Powered by Large Language Models},
  year = {2022},
  author = {Panagopoulou, Artemis and Cugini, Manni Arora Li Zhang Dimitri and You, Weiqiu and Zhou, Yue Yang Liyang and Hou, Yuxuan Wang Zhaoyi and Hwang, Alyssa and Martin, Lara and Callison-Burch, Sherry Shi Chris and Yatskar, Mark},
  journal = {Alexa Prize TaskBot Challenge Proceedings}
}

Visualizing the Obvious: A Concreteness-based Ensemble Model for Noun Property Prediction

Yue Yang, Artemis Panagopoulou, Marianna Apidianaki, and 2 more authors

In Findings of the Association for Computational Linguistics: EMNLP 2022, Dec 2022

Abs

Neural language models encode rich knowledge about entities and their relationships which can be extracted from their representations using probing. Common properties of nouns (e.g., red strawberries, small ant) are, however, more challenging to extract compared to other types of knowledge because they are rarely explicitly stated in texts.We hypothesize this to mainly be the case for perceptual properties which are obvious to the participants in the communication. We propose to extract these properties from images and use them in an ensemble model, in order to complement the information that is extracted from language models. We consider perceptual properties to be more concrete than abstract properties (e.g., interesting, flawless). We propose to use the adjectives’ concreteness score as a lever to calibrate the contribution of each source (text vs. images). We evaluate our ensemble model in a ranking task where the actual properties of a noun need to be ranked higher than other non-relevant properties. Our results show that the proposed combination of text and images greatly improves noun property prediction compared to powerful text-based language models.

2021

Visual Goal-Step Inference using wikiHow

Yue Yang, Artemis Panagopoulou, Qing Lyu, and 3 more authors

In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Dec 2021

Induce, edit, retrieve: Language grounded multimodal schema for instructional video retrieval

Yue Yang, Joongwon Kim, Artemis Panagopoulou, and 2 more authors

arXiv preprint arXiv:2111.09276, Dec 2021

Bib

@article{yang2021induce,
  title = {Induce, edit, retrieve: Language grounded multimodal schema for instructional video retrieval},
  author = {Yang, Yue and Kim, Joongwon and Panagopoulou, Artemis and Yatskar, Mark and Callison-Burch, Chris},
  journal = {arXiv preprint arXiv:2111.09276},
  year = {2021}
}

Self-supervised optical flow with spiking neural networks and event based cameras

Kenneth Chaney, Artemis Panagopoulou, Chankyu Lee, and 2 more authors

In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Dec 2021

Bib

@inproceedings{chaney2021self,
  title = {Self-supervised optical flow with spiking neural networks and event based cameras},
  author = {Chaney, Kenneth and Panagopoulou, Artemis and Lee, Chankyu and Roy, Kaushik and Daniilidis, Kostas},
  booktitle = {2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  pages = {5892--5899},
  year = {2021},
  organization = {IEEE}
}