Skip to main navigation menu Skip to main content Skip to site footer

The Effects of Label Errors in Training Data on Model Performance and Overfitting

Abstract

Training data used in machine learning applications are often assumed to be perfect, i.e., do not contain any errors; however, this is almost never the case and may lead to limitations in the resulting model performance. In this paper, the effects of the presence of label errors in training data are studied quantitatively and in relation to model overfitting. By artificially creating label errors, it is observed that a constrained (small) CNN model exhibits remarkable generalizability --- retaining high accuracy even when most data are mislabelled! Test accuracy catastrophically falls only for unrealistically high label error rates, at a point related to the number of classes present in the data. These preliminary experiments pave the road towards further studies of model robustness, possibly offering a quantitative method through which to compare models.
PDF