Synthetic Data (or: how to prevent the heat death of Machine Learning)

Elaia

Published in

Elaia

9 min readFeb 22, 2024

Still from Ridley Scott’s *Blade Runner* (1982)

By Arturo Ancira García. Edited by Anya Brochier, Louisa Mesnard and Marc Rougier.

What does the future of computing look like ?

This is a recurring question we ask ourselves at Elaia as we try to envision the key technologies, needs, and uses of everything compute-related in the near future. Something that caught our attention recently, and that reshapes our view of the question, was a study showing that high-quality language data could be exhausted by 2026 and vision data by 2030 — something that will undoubtedly slow down machine learning progress.

Although some may argue that this should be taken more as a thought experiment than as a news article, this can be seen happening in real-time, right outside our digital door step. Common Crawl, a non-profit organization that crawls the internet and provides its data sets for public use, crawled almost 500 Tb of uncompressed content (more than 3 billion webpages) last December. Considering that there’s an estimated 200 million active websites (each composed of multiple pages) on the internet, it is safe to say that AI model-builders have already exhausted most of the public domain.

An impending shortage of real-world data presents a risk not only to AI companies, but to the general public, too. As real-world, representative samples of data become more valuable, it becomes more guarded and less democratized. This trend can be seen today among AI industry leaders — including those that were traditionally open-source companies. Technical reports from new releases now withhold all useful information from the research community due to economic pressures to protect their talent pools and to keep the most cutting-edge research to themselves. In other words, as data becomes more commoditized, competition in the foundation model space increases drastically.

To achieve differentiation, the first line of defense is, as often happens, the team’s mathematical abilities, and how receptive they are to scientific progress so as to remain as state-of-the-art. In a second phase, when open-source catches up with novel innovations in architecture and performance, the last big differentiator between models are the data sets used for training.

So, although it has recently become inseparable from the rest of the VC buzzwords being thrown around, synthetic data is, objectively, an essential brick in both the AI and data-product stacks.

Common Crawl’s Cumulative Crawl Size (until December 2023)

Let’s rewind: what is synthetic data?

Synthetic data can be thought of as data that does not come from real world measurements nor data collections. NVIDIA defines it as “annotated information that computer simulations or algorithms generate as an alternative to real-world data”. To put it differently, even though generated, synthetic data reflects real-world data as a mathematical transformation that contains little, or none, of the original information, and where the transformation’s utility is directly proportional to the similarity in the two data set’s descriptive statistics and outliers, (sorry). To use a simplified illustration, imagine taking an image and rotating it 180 degrees; where the resulting image’s pixel composition remains unchanged but, aesthetically, it becomes a different image to the untrained eye.

Characters from Martin Handford’s *Where’s Waldo?*

Like its real-life counterpart, synthetic data can take many forms including text, tabular data, audio, or images. The use of synthetic data has become essential in industries that require precise, specific, and vast amounts of real-world data to train AI models. In essence, more data (usually) means better ML models, and better ML models means a more complete picture of whatever real-world scenario is being simulated; what the outliers are, the anomalies, and what can be repeatedly observed.

So what’s another giant paywall?

Synthetic data is a tool that plays an important role in accelerating the training of ML models, extending data sets, and protecting sensitive information. But, albeit an idealistic point of view, synthetic data makes it possible for independent researchers and smaller-scale model builders, who may lack the resources and bargaining power of industry giants, to actively participate in the research landscape. Although the theoretical limitations of synthetic data are not completely clear, this democratization of data access not only fosters innovation but also levels the playing field, empowering a broader spectrum of contributors to make meaningful ML advancements in their respective fields.

The relevance of synthetic data extends across the most pertinent spaces inside the AI industry today — and intersects with many sectors that we’ve been investing in at Elaia, since 2002. Given this ongoing trend, actors within this domain cannot afford to neglect closely monitoring developments in the realm of synthetic data; the opportunity cost of failing to do so is too high.

Deciphering the different types of synthetic data

The synthetic data space is fragmented into two categories: structured and unstructured data. For practical purposes, the main difference between the two is that structured data can be formatted, organized, and searchable (think: tabular data like a time series, for instance), while unstructured data has no predefined format or structure, making it difficult to categorize or deconstruct (think: video, audio, social media posts, etc).

The synthetic versions of structured data are mainly used to enlarge ML training data sets, reduce biases, and increase data completeness in data sets. Unstructured synthetic data is usually seen in the form of 2D, 3D or 4D images, generated using diffusion models, GANs, or VAEs, and are used to train computer vision models for applications like medical imagery or self-driving cars.

Unstructured synthetic data: a data play?

The inherent value of real-world data is a function of its scarcity, the labor involved in the recollection process, how much work is required to clean and preprocess, and its number of use cases. As the amount of competition is proportional to the accessibility of the real-world data sets, many actors in the unstructured synthetic data space decide to specialize in niche verticals as a way to outrun competitors.

In the long run, winners in this domain will likely emerge due to the quality/quantity of their partnerships and less so because of their technical superiority.

In medical imagery, for example, the most sought after data sets are related to niche pathologies, niche demographics, or a combination of the two. This type of information is not only limited but often sensitive, too; incentivising health and research institutions to guard these real-world images quite closely. This makes it necessary for players in the sector to establish symbiotic partnerships should they hope to sell synthetic versions to other actors in the space. In this sense, the difficulties in obtaining quality (and cheap) unstructured synthetic data are inseparable from those of obtaining the real version; making us admire the strategic positioning of data vendors and those with access to privileged sets of information. In the long run, winners in this domain will likely emerge due to the quality/quantity of their partnerships and less so because of their technical superiority.

In a conversation with Alexis Ducarouge, co-founder and CTO of Gleamer, an Elaia portfolio company that develops an AI co-pilot for radiologists, and who have been exploring relevant synthetic data solutions closely for several years now, one conclusion emerged: although synthetic data applications in this sector have the potential to disrupt the development of internal tools and products, the quality of generated images, for practical uses, is not yet medical grade. Consequently, in certain cases, getting your hands on the real data from the start remains a more viable option than waiting for synthetic alternatives to mature.

Pairs of real and synthetic retina vessel maps (RVMs) from the research paper “*Synthetic Medical Images for Robust, Privacy-Preserving Training of Artificial Intelligence: Application to Retinopathy of Prematurity Diagnosis”* (2022)

Structured synthetic data: an engineering play?

With most datasets used in ML models containing tabular files, problematics such as class imbalance or impurity issues — such as missing values — are omnipresent in training data. A way to overcome this is to fill in these gaps with synthetic data. The current challenges mostly lie in the preprocessing phase, where existing, widely used methods either over-process data, often losing important information, or introduce new data points (a byproduct of model hallucinations).

Today, novel approaches to structured synthetic data generation use LLMs and knowledge distillation as a way to overcome the preprocessing bottleneck. These techniques propose adding textual context to tabular data to eventually generate a synthetic representation based on semantic understanding. In a nutshell, this process usually involves a textual encoder that converts each row of the tabular dataset into a textual representation (think: 5 -> “five”), then, distills the textually encoded tabular data knowledge onto a pre-trained LLM, using the fine-tuned LLM to contextualize the text and generate synthetic data.

This space may not be new, but with the constant advancements in AI and the democratization of open-source LLM tooling, many actors with solid ML knowhow have begun proposing a faster, cheaper, and more qualitative alternative to generating synthetic data. The result? A highly competitive sector, where winners must constantly innovate at the cutting-edge to stay in the race and be always thinking about how to creatively address underserved use cases; essentially finding a good problem to solve before anybody else.

In a discussion with Hugo Laurençon, Research Engineer at Hugging Face, and PhD in Machine Learning — and who recently did his turns around AI Twitter because of his work with synthetically generated HTML/CSS codes — he highlighted that although generating structured synthetic data remains an iterative process, you can’t improve the output quality indefinitely; the limiting factors remain the amount of available computation power, and the quality of the base-model used to generate synthetic data.

A message in a bottle:

There are still many questions that remain unanswered — and that naturally at Elaia, we’ve been bouncing around back and forth with actors in this space. It is likely that model-builders will consistently require synthetic data in one form or another, but, from an entrepreneurial standpoint, does addressing this pain-point represent a one-shot deal or an ongoing endeavor? Can solutions to specific problems be scaled to general needs? Will these solutions remain pertinent only in isolated instances, preventing synthetic data generation from becoming an easily deployable, packaged product, and hence remaining as an internal tool? Although there are no definite answers, the domain of synthetic data continues to evolve and, from our perspective, is a fruitful space for researchers and entrepreneurs alike — and thus imperative for us at Elaia to keep an eye on.

Diagram from Ferdinand de Saussure’s *Cours de Linguistique Générale* (1916)

Real-data is dead, long live real-data!

The ratio of real to generated content floating around the internet today is incontestably trending towards generated. Luckily, we can still (mostly) tell the difference between the two because of the style of writing, its structure, or its context — for odds are that this will soon be impossible to do so — leaving the Turing test as a relic of the past. Similarly, synthetic data has the potential to become a 1:1 representation of real-world measurements. And, semiotically, real-world measurements are abstract, 1:1 representation of real-world observations: in the same way that the symbol “4” commonly represents four individual units of something — or that Elaia means hirondelle in Basque.

Apart from the monetizable applications of mathematics, there is also a sense of beauty in the pursuit of the expression that can most-approximately mirror a real-world phenomenon in its simplest form. Or a new answer that is less wrong than the previous one.

If this rings a bell and you’re working closely with synthetic data or building in spaces adjacent to it, don’t hesitate to contact us. At Elaia, we are looking forward to backing the next ambitious tech and deep tech entrepreneurs from the early stages to growth.