Skip to main navigation menu Skip to main content Skip to site footer

Investigating the Impact of Inclusion in Face Recognition Training Data


Modern face recognition systems leverage datasets containing im-
ages of hundreds of thousands of individuals’ faces. Recently, there
has been significant public scrutiny into the privacy implications of
large-scale training datasets such as MS-Celeb-1M, as many peo-
ple are uncomfortable with their face being used to train dual-use
technologies that can enable mass surveillance. However, the im-
pact of an individual’s inclusion in training data on a derived sys-
tem’s ability to recognize them has not previously been studied. In
this work, we audit ArcFace, a state-of-the-art, open-source face
recognition system, in a large-scale face identification experiment.
We find Rank-1 identification accuracy of 79.71% for individuals
present in training data and 75.73% for those not present. These re-
sults demonstrate that modern face recognition systems work bet-
ter for individuals they are trained on, which has serious privacy
implications as all large-scale, open-source training datasets do not
gather informed consent from individuals during their collection.