Skip to main navigation menu Skip to main content Skip to site footer

Evaluation Methods for Synthetic Data in Pursuit of Open Data

Abstract

Real data containing sensitive or personal data often requires lengthy approval processes and stringent restrictions for access. Synthetic data that resembles the real data and is generated from the real data following FAIR standards is a promising approach to open data for administrative data. Although progress has been made in establishing accepted evaluations for synthetic data models, missing are key holistic metrics for policymakers to aid their decision-making on open data initiatives. In this paper, we introduce and demonstrate a privacy risk with an identity disclosure risk assessment (IDR), a quantitative measure of univariate distribution in Hellinger distance (HD), and a quantitative bivariate measure of differential pairwise correlation (DPC). By including our introduced privacy, univariate, and bivariate metrics in standard synthetic data evaluation, synthetic data models and methods can be better understood and utilized by policymakers in pursuit of open data.
PDF