Visual Unit Tests for More Robust Visual Programming

Abstract

Programming based approaches to reasoning tasks have substantially expanded the types of questions models can answer about visual scenes. Yet on benchmark visual reasoning data, when models answer correctly, they produce incorrect programs 33% of the time. These models are often right for the wrong reasons and risk unexpected failures on new data. Unit tests play a foundational role in ensuring code correctness and could be used to repair such failures.

We propose Visual Unit Testing ( viunit logo ), a framework to improve the reliability of visual programs by automatically generating unit tests. In our framework, a unit test is represented as a novel image and answer pair meant to verify the logical correctness of a program produced for a given query. Our method leverages a language model to create unit tests in the form of image descriptions and expected answers and image synthesis to produce corresponding images.

We conduct a comprehensive analysis of what constitutes an effective visual unit test suite, exploring unit test generation, sampling strategies, image generation methods, and varying the number of programs and unit tests. Additionally, we introduce four applications of visual unit tests: best program selection, answer refusal, re-prompting, and unsupervised reward formulations for reinforcement learning. Experiments with two models across three datasets in visual question answering and image-text matching demonstrate that viunit logo improves model performance by 11.4%. Notably, it enables 7B open-source models to outperform gpt-4o-mini by an average of 7.7% and reduces the occurrence of programs that are correct for the wrong reasons by 40%.

Framework Overview

Use-Cases of Visual Unit Tests

viunit logo can be used in a variety of applications to improve the robustness of visual programming.

Here are some examples that we explore in the paper and show to outperform existing methods:

Best Program Selection

viunit logo can be used to select the best program from a set of candidates, allowing even 7B models to surpass in performance gpt-4o-mini by an average of 7.7%.

GPT-4, CodeLlama-7B (Base Setup), CodeGemma-7B (Base Setup),
CodeLlama-7B (ViUniT), CodeGemma-7B (ViUniT).

Answer Refusal

Answer refusal under uncertainty is a long standing challenge in AI. viunit logo can be used to refuse answering if the code falls below a unit test score threshold, correctly suppressing answering with an F1 score of 0.8 (F1 Refusal) with as low as 2% of programs that pass the tests responding wrong (Pass Failure Rate).

F1 Refusal Score, Pass Failure Rate.

Re-prompting

viunit logo can be used to re-prompt the model with a new question if the program fails to pass the unit tests instead of refusing to answer.

Unsupervised Reinforcement Learning Reward

Unsupervised viunit logo Reward surpasses Supervised Correctness Reward!

GPT-4, CodeLlama-7B (Base Setup), CodeGemma-7B (Base Setup),
CodeLlama-7B (Correctness), CodeGemma-7B (Correctness),
CodeLlama-7B (ViUniT), CodeGemma-7B (ViUniT).

BibTeX

@article{panagopoulou2024viunit,
  author    = {Panagopoulou, Artemis, Honglu Zhou, Silvio Savarese, Caiming Xiong, Chris Callison-Burch, Mark Yatskar, Juan Carlos Niebles},
  title     = {ViUniT: Visual Unit Tests for More Robust Visual Programming},
  journal   = {ArXiV},
  year      = {2024},
}

: Visual Unit Tests for More Robust Visual Programming

Visual programs can by correct for the wrong reasons!

automatically generates unit tests for visual programs.