Skip to main navigation menu Skip to main content Skip to site footer

Action Recognition using Deep Convolutional Neural Networks and Compressed Spatio-Temporal Pose Encodings


Convolutional neural networks have recently shown proficiency at
recognizing actions in RGB video. Existing models are gener-
ally very deep, requiring large amounts of data to train effectively.
Moreover, they rely mainly on global appearance and could poten-
tially underperform in single-environment applications, such as a
sports event. To overcome these limitations, we propose to short-
cut spatial learning by leveraging the activations within a human
pose estimation network. The proposed framework integrates a
human pose estimation network with a convolutional classifier via
compressed encodings of pose activations. When evaluated on
UTD-MHAD, a 27-class multimodal dataset, the pose-based RGB
action recognition model achieves a classification accuracy of 98.4%
in a subject-specific experiment and outperforms a baseline method
that fuses depth and inertial sensor data.