VisVM

Abstract

Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVMguided search significantly enhances VLMs’ ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVMguided captions improve VLM’s performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs.

Vision Value Model

Vision Value Model (VisVM) is a value network which can provide reward signal to guide VLM inference-time search by generating descriptive captions in a step-by-step manner.

The core innovation lies in breaking down diverse responses from VLM into sentence pairs and using CLIP as the reward signal to train VisVM through Temporal Difference learning.

This enables VisVM to predict the impact of current sentences on future generations, allowing it to avoid response candidates during inference time with higher hallucination risks and generate image descriptions that are less prone to hallucination and more detailed.

Inference-time Search using VisVM

We use VisVM as the signal to guide the VLM inference-time search for generating higher-quality responses. At each searching step we sample several sentence candidates and evaluate the value using VisVM. The candidate with the highest value is selected as the response for the current step. This process continues iteratively until the complete response sequence is generated.

VisVM-Guided Search Improves Response Quality

Response Quality Evaluation We use Human and GPT-4o as judges to compare image descriptions generated by VisVM Guided Search, CLIP-PRM Guided Search, Best of N, and greedy decoding. We select LLaVA-Next-Mistral-7B as base model for generating responses. The results show that descriptions from VisVM Guided Search are significantly more preferred.

Less Hallucination, More Details The image description obtained using VisVM search not only significantly reduces hallucinations, but also provides a more precise and detailed depiction of the image. For example, in this case, the description includes subtle details that even a meticulous human annotator might overlook, such as “There are also green street signs...which are partially obscured by the raindrops on the windshield.”

Improved Hallucination Benchmark Performance By incorporating VisVM guided search during inference, LLaVA-Next achieves remarkable improvements on the hallucination benchmarks, significantly outperforming other search/decoding baselines.

VisVM for Self-Training Vision-Language Model

We use LLAVA-next-7B as the base model, leveraging VisVM as a reward signal to generate high-quality image descriptions as SFT data and train LLaVA-Next-7B. Across nine comprehension and hallucination benchmarks, VisVM-guided self-training boosted LLAVA-next-7B's performance by an average of 10.8%. This demonstrates the potential of applying this method for a genuine self-training pipeline that continuously enhances visual comprehension.

Citation

If you find our work useful, please consider citing the paper as follows:


@misc{wang2024scalinginferencetimesearchvision,
      title={Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension}, 
      author={Xiyao Wang and Zhengyuan Yang and Linjie Li and Hongjin Lu and Yuancheng Xu and Chung-Ching Lin and Kevin Lin and Furong Huang and Lijuan Wang},
      year={2024},
      eprint={2412.03704},
      archivePrefix={arXiv},
      primaryClass={cs.CV}, 
}