Skip to main navigation menu Skip to main content Skip to site footer

Explainable Chain-of-Thought Object Counting in Vision-Language Models using Reinforcement Learning

Abstract

Counting objects in images remains challenging for vision-language models, especially when multiple instances are small, dense, or overlapping. We introduce an explainable counting framework for Qwen-2.5-VL that uses reinforcement learning with Group Relative Policy Optimization (GRPO) and Low-Rank Adaptation (LoRA) to produce not only a numeric count, but also a transparent chain-of-thought pointing to each object via centroid coordinates. Using an augmented TallyQA subset with centroid annotations, we design a multi-objective reward system that jointly optimizes format adherence, count accuracy, and spatial precision. Our GRPO-trained model achieves 67.94 \% counting accuracy and 92.62 \% pointing accuracy, clearly outperforming both the baseline (34.84 \% / 2.73 \%) and supervised fine-tuning (59.93 \% / 86.89 \%). Ablation studies show that single-reward training often leads to reward exploitation, while combining complementary rewards produces balanced and interpretable outputs. This demonstrates the potential of reinforcement learning for making visual reasoning both accurate and verifiable.
PDF