Human action recognition and facial-expression analysis in cinematic footage remain challenging because most systems ignore camera-dependent changes in visible detail. Close-ups, medium shots, and long shots contain fundamentally different expressive cues, yet most models treat scale as irrelevant. This paper addresses this gap by introducing ZoomGate, a unified, scale-aware pipeline for human behaviour understanding across mixed zoom levels. Using a movie-trailer dataset with frame-level zoom annotations, we train image backbones to classify camera scale and route video segments to view-specific recognition modules tailored for facial emotions, micro-gestures, upper-body actions, full-body motion, or hand-only gestures. For close-up segments, a multimodal Gemini-based analysis produces structured, temporally aligned descriptions of emotion dynamics and articulatory behaviour. Experiments demonstrate that scale-conditioned processing yields more coherent and interpretable predictions than scale-agnostic baselines. ZoomGate provides a principled foundation for building computer-vision systems and AI characters that adjust behaviour naturally with camera distance.
PDF