Explainable Chain-of-Thought Object Counting in Vision-Language Models using Reinforcement Learning

E. Zhixuan; Saeejith Nair; Junfeng Lei

doi:10.15353/jcvis.v11i1.10009

Vol. 11 No. 1 (2025)
Special Issue: Proceedings of CVIS 2025

Articles

Explainable Chain-of-Thought Object Counting in Vision-Language Models using Reinforcement Learning

https://doi.org/10.15353/jcvis.v11i1.10009

Published 2026-04-01

E. Zhixuan
Saeejith Nair
Junfeng Lei

How to Cite

Zhixuan, E., Nair, S., & Lei, J. (2026). Explainable Chain-of-Thought Object Counting in Vision-Language Models using Reinforcement Learning. Journal of Computational Vision and Imaging Systems, 11(1), 50–55. https://doi.org/10.15353/jcvis.v11i1.10009

Download Citation

Abstract

Counting objects in images remains challenging for vision-language models, especially when multiple instances are small, dense, or overlapping. We introduce an explainable counting framework for Qwen-2.5-VL that uses reinforcement learning with Group Relative Policy Optimization (GRPO) and Low-Rank Adaptation (LoRA) to produce not only a numeric count, but also a transparent chain-of-thought pointing to each object via centroid coordinates. Using an augmented TallyQA subset with centroid annotations, we design a multi-objective reward system that jointly optimizes format adherence, count accuracy, and spatial precision. Our GRPO-trained model achieves 67.94 \% counting accuracy and 92.62 \% pointing accuracy, clearly outperforming both the baseline (34.84 \% / 2.73 \%) and supervised fine-tuning (59.93 \% / 86.89 \%). Ablation studies show that single-reward training often leads to reward exploitation, while combining complementary rewards produces balanced and interpretable outputs. This demonstrates the potential of reinforcement learning for making visual reasoning both accurate and verifiable.

PDF