Dietary monitoring is a complex yet highly impactful challenge within food computing, given its potential to transform the personalized management of metabolic and general health. Traditional 2D image-based assessment methods capture only static visual cues, offering limited information on eating behaviour. In contrast, video-based analysis provides greater temporal information, enabling the study of what is eaten, how it is eaten, and in what quantity. Prior research in this area introduced a baseline framework that used Vision-Language Models (VLMs) to analyze eating videos on a frame-by-frame basis. While this approach established the feasibility of using VLMs to interpret eating behaviours, it also revealed key limitations in model accuracy and contextual understanding. Building on this foundation, the present study evaluates the Gemini 2.5 Flash model relative to the previous framework across diverse eating scenarios and examines its performance with more specific prompts. These findings offer high-level insights into the potential of modern multimodal VLMs for dietary monitoring and pave the way for more accurate and practical approaches to video-based nutrient assessment.
PDF