Skip to main navigation menu Skip to main content Skip to site footer

FoodVideoQA: A Novel Framework for Dietary Monitoring

Abstract

Food intake monitoring is a crucial area of research in food computing due to its complexity and significant potential for improving health outcomes. While traditional 2D image-based dietary assessments provide basic information, video offers a more detailed understanding of both the quantity of food consumed and the manner in which it is eaten. However, current video-based dietary analysis remains limited to coarse metrics, such as counting bites. In this paper, we introduce FoodVideoQA, a novel approach that leverages Vision-Language Models (VLMs) to analyze food intake videos comprehensively. Our framework includes lists of ingredients, utensils, consumed foods, and specific time intervals in a video where a person is eating. This work paves the way for more advanced multimodal food intake measurement and behavioral studies.
PDF