top of page
Jack Spurrier

Synthetic Data: A Warning

Synthetic data on the surface sounds redundant: simply artificial duplicates of real data. These data sets contain all the same mathematical and statistical properties as a real-world data set, but the individual datapoints are completely different. Therefore, there is no sensitive human information contained in the set. The principal purpose for synthetic data is data privacy. Developers and engineers can use this artificial information to stand in for real data without being able to access individual’s data.


Approximately 69% of countries globally have adopted data protection laws, including GDPR in EU and Data Protection Act 2018 in UK (following Brexit). The rise of data privacy has made it increasingly difficult for businesses, governments, and individuals to process data. First, they must comply with stringent country-specific red tape, but also contend with cross-border inconsistencies in legislation to process worldwide data. Data privacy has become something of a legal minefield. The legislation is necessary to prevent predatory privacy invasion from tech companies but reduces big data’s impact on building machine learning tools, and digital twins (metaverse). However, synthetic data is a growing solution to these regulations.


Synthetic data uses machine learning to enable machine learning. Synthetic data is generated by a deep learning artificial intelligence tool called Generative Adversial Networks (GANs). The same type of machine learning tools that are building NFTs. The GAN generates a completely random dataset and then iteratively adjusts datapoints so that its abstract properties become closer to those in the real data. Eventually, the synthetic dataset cannot be differentiated from the real-world data. The result is an anonymised dataset containing the same complex correlations between its variables as real-world data. In 2016, a study by Synthetic Data Vault found that 70% of the time there was no difference in performance between synthetic data and real-world data.


There are three main commercial benefits to synthetic data. Firstly, privacy. Synthetic data can protect sensitive data being shared. Due to increased privacy, it can bypass data protection laws. The circumvention of data protection laws can vastly reduce compliance costs. Approximately 88% of businesses spend more than $1m on GDPR compliance. Secondly, speed. Synthetic data can be accessed and processed quicker to train and deploy AI solutions. Thirdly, scale. The anonymisation of data allows for increased sharing datasets, and reduced market price. Whilst this is an exciting proposition on the surface, there are ethical implications that rot at the core of AI development which need to be recognised.


Synthetic data importantly strips a dataset of human data points, but the large-scale abstract patterns still represent human bias. A long-standing problem with AI is its veneer of objectivity. A pervasive assumption is that AI is independent of human understanding, human idiosyncrasies, and human biases since it learns in its own complex way. However, human bias riddles AI’s inputted data and, therefore, outputs. Synthetic data is no different. What’s worse is that now this so-called objective, anonymised data is training AI where its output might be put at a premium due to its supposed ‘neutral’ inputs. Synthetic data is the latest mirage of neutrality in artificial intelligence which is multiplied by its application in training machine learning tools.


The anonymisation of data obscures the explainability of AI. Explainability is an important concept within AI: it is necessary for an algorithm to explain why it is allocating a kidney transplant to one person, but not another. Humans have a duty to explain how life changing decisions have been reached. Timnit Gebru, AI ethicist, points to datasheets for datasets which explain how the AI was trained. However, a synthetic datasheet cannot be explainable since it is derived from random datapoints. Human biases based on race, gender, wealth or age can pollute these datasets and there is no way of identifying them.


Developing synthetic data is a brilliant tool to anonymise data and democratise datasets, but it masquerades as wholly socially beneficial. Synthetic data is another example of disguising human bias with artificially ‘intelligent’ algorithms. It is important that researchers, data scientists, and data analysts recognise, evaluate, and recalibrate datasets to be ‘fair’ before generating synthetic data.

209 views0 comments

Recent Posts

See All

Comments


bottom of page