Investigating the Impact of Inclusion in Face Recognition Training Data
Abstract
Modern face recognition systems leverage datasets containing im-
ages of hundreds of thousands of individuals’ faces. Recently, there
has been significant public scrutiny into the privacy implications of
large-scale training datasets such as MS-Celeb-1M, as many peo-
ple are uncomfortable with their face being used to train dual-use
technologies that can enable mass surveillance. However, the im-
pact of an individual’s inclusion in training data on a derived sys-
tem’s ability to recognize them has not previously been studied. In
this work, we audit ArcFace, a state-of-the-art, open-source face
recognition system, in a large-scale face identification experiment.
We find Rank-1 identification accuracy of 79.71% for individuals
present in training data and 75.73% for those not present. These re-
sults demonstrate that modern face recognition systems work bet-
ter for individuals they are trained on, which has serious privacy
implications as all large-scale, open-source training datasets do not
gather informed consent from individuals during their collection.