COMMISSIONED: As with any emerging technology, implementing generative AI large language models (LLMs) isn’t easy and it’s totally fair to look side-eyed at anyone who suggests otherwise.
From issues identifying use cases that maximize business value to striking the right balance between hard-charging innovation and sound governance, companies face their fair share of GenAI struggles.
Now it seems even those LLMs could use some help. If AI experts have it right, LLMs may be running out of fresh training data, which has the AI sector looking to a possible stopgap: synthetic data.
In the context of LLMs, synthetic data is artificially manufactured using the statistical properties of real-world data without using real information about companies or people and other entities. Using synthetic data helps organizations model outcomes without exposing themselves to security or privacy risks.
Some experts believe that by conjuring new data with which to populate outputs synthetic data can help LLMs clear the so-called data wall. To better understand the value of synthetic data, it helps to grasp the pending limitations posed by real-world data.
The data wall
Academics and AI luminaries alike have noted the probability for LLMs to hit a limit to the amount of human-generated text with which they’re trained – possibly as soon as 2026.
The data shortfall presents a problem because as the volume of training data declines, models can struggle to generalize. This can lead to overfitting, a phenomenon in which a model masters its training data so much that it performs poorly on new data, resulting in less coherent outputs.
And while experts began publicizing the problem shortly after OpenAI’s kicked off the GenAI race by launching ChatGPT two years ago, VCs powerful enough to pull the financial levers of this market have lent their voices to the issue.
“The big models are trained by scraping the internet and pulling in all human-generated training data, all-human generated text and increasingly video and audio and everything else, and there’s just literally only so much of that,” said Marc Andreessen, co-founder of Andreessen Horowitz.
The problem is serious enough that AI companies have gone analog, hiring human domain experts such as doctors and lawyers to handwrite prompts for LLMs.
Barring any breakthroughs in model techniques or other innovations that help GenAI hurdle the coming data wall, synthetic data may be the best available option.
Big brands swear by synthetic data
Synthetic data is particularly useful for helping organizations simulate real-world scenarios, including everything from what merchandise customers may purchase next to modeling financial services scenarios without the risk of exposing protected data.
Walmart, for one, synthesizes user behavior sequences for its sports and electronics categories to predict next purchases. Walmart employees vet the data throughout the process to ensure integrity between the user behavior sequence and the prediction.
The human-in-the-loop factor may be key to harnessing synthetic data to improve outcomes. For example, combining proprietary data owned by enterprises with reasoning from human employees can create a new class of data that corporations can use to create value.
This “hybrid human AI data approach” to creating synthetic data is something that organizations such as JPMorgan are exploring, according to Alex Wang, a senior research associate with the financial services company, who noted that JPMorgan has 150 petabytes of data at its disposal compared to 1 petabyte OpenAI has indexed for GPT 4.
In fact, OpenAI itself has used its Strawberry reasoning model to create data for its Orion LLM. You read that right – OpenAI is using its AI models to train its AI models.
The bottom line
Synthethic data has its limitations. For example, it often fails to capture the complexity and nuances – think sarcasm or turns of phrase – which makes real-world data so rich. This can reduce the relevancy of results, thus limiting the value of scenarios synthetic data is meant to model.
As with real-world data, algorithms used to generate synthetic data can include or amplify existing biases, which can lead to biased outputs. Moreover, ensuring the model trained on synthetic data performs well may require using supplementary real-world data, which can make fine-tuning challenging. Similarly, inaccuracies hallucinations remain an issue in synthetic data.
The challenges that come with using synthetic data require the same sound data governance practices organizations are leveraging with LLMs that train on real-world data. As such, many data engineers view the use of synthetic data to populate models as complementary.
Even so, an existential data crisis isn’t required to capitalize on the benefits of using synthetic data. And your organization needn’t be Walmart of JPMorgan’s scale to take advantage of the opportunities synthetic data has to offer.
Knowing how to effectively leverage synthetic data may be challenging for organizations who haven’t leveraged such techniques to manage and manipulate their data.
Dell Technologies offers access to professional services, as well as a broad open ecosystem of vendors, that can help you embark on your synthetic data creation journey.
Learn more about the Dell AI Factory.
Brought to you by Dell Technologies.