Skip to main navigation menu Skip to main content Skip to site footer

Improving Speech Emotion Recognition: A Semi-Supervised Approach for Fine-Grained Analysis

Abstract

A key challenge in speech emotion recognition (SER) is the lack of fine-grained datasets contain- ing both emotion and intensity labels, which lim- its the performance of data-demanding deep learn- ing models in applications like social companion robots. Most existing datasets cover only ba- sic emotions and rarely include nuanced inten- sity annotations. To address this gap, we present using semi-supervised learning (SSL) to create a larger fine-grained SER (FGSER) dataset from lim- ited available datasets. Our model classifies 5 distinct emotions—anger, sadness, happiness, dis- gust, and fear—each represented across three inten- sity levels: low, medium, and high. We propose two SSL approaches tailored to different applica- tion needs: a Random Forest Classifier (RFC) for edge-computing environments that demand compu- tational efficiency, and a Convolutional Neural Net- work (CNN) for scenarios where higher accuracy is critical. Including only high-confidence predictions to the original small dataset will increase the size of the dataset and hence improvement of the clas- sifier’s accuracy and generalization. This enhance- ment supports the development of conversational AI with high emotional intelligence (EQ), advanc- ing FGSER for richer human-computer and human- robot interactions, more specifically for social com- panion robotic applications.
PDF