Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVMguided search significantly enhances VLMs’ ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVMguided captions improve VLM’s performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs.
Vision Value Model (VisVM) is a value network which can provide reward signal to guide VLM inference-time search by generating descriptive captions in a step-by-step manner.
The core innovation lies in breaking down diverse responses from VLM into sentence pairs and using CLIP as the reward signal to train VisVM through Temporal Difference learning.
This enables VisVM to predict the impact of current sentences on future generations, allowing it to avoid response candidates during inference time with higher hallucination risks and generate image descriptions that are less prone to hallucination and more detailed.
We use VisVM as the signal to guide the VLM inference-time search for generating higher-quality responses. At each searching step we sample several sentence candidates and evaluate the value using VisVM. The candidate with the highest value is selected as the response for the current step. This process continues iteratively until the complete response sequence is generated.
We use LLAVA-next-7B as the base model, leveraging VisVM as a reward signal to generate high-quality image descriptions as SFT data and train LLaVA-Next-7B. Across nine comprehension and hallucination benchmarks, VisVM-guided self-training boosted LLAVA-next-7B's performance by an average of 10.8%. This demonstrates the potential of applying this method for a genuine self-training pipeline that continuously enhances visual comprehension.
If you find our work useful, please consider citing the paper as follows:
@misc{wang2024scalinginferencetimesearchvision,
title={Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension},
author={Xiyao Wang and Zhengyuan Yang and Linjie Li and Hongjin Lu and Yuancheng Xu and Chung-Ching Lin and Kevin Lin and Furong Huang and Lijuan Wang},
year={2024},
eprint={2412.03704},
archivePrefix={arXiv},
primaryClass={cs.CV},
}