A key challenge in speech emotion recognition
(SER) is the lack of fine-grained datasets contain-
ing both emotion and intensity labels, which lim-
its the performance of data-demanding deep learn-
ing models in applications like social companion
robots. Most existing datasets cover only ba-
sic emotions and rarely include nuanced inten-
sity annotations. To address this gap, we present
using semi-supervised learning (SSL) to create a
larger fine-grained SER (FGSER) dataset from lim-
ited available datasets. Our model classifies 5
distinct emotions—anger, sadness, happiness, dis-
gust, and fear—each represented across three inten-
sity levels: low, medium, and high. We propose
two SSL approaches tailored to different applica-
tion needs: a Random Forest Classifier (RFC) for
edge-computing environments that demand compu-
tational efficiency, and a Convolutional Neural Net-
work (CNN) for scenarios where higher accuracy is
critical. Including only high-confidence predictions
to the original small dataset will increase the size
of the dataset and hence improvement of the clas-
sifier’s accuracy and generalization. This enhance-
ment supports the development of conversational
AI with high emotional intelligence (EQ), advanc-
ing FGSER for richer human-computer and human-
robot interactions, more specifically for social com-
panion robotic applications.
PDF