Commissioned: It’s not a revolutionary statement to say data has value – how many times have we heard it’s the new oil? But what might not be readily apparent is how much. Up until this point, the vastness of data – and lack of tools to efficiently parse it – has made it almost impossible to analyze at scale. But that’s all changing thanks to generative AI (GenAI), as well as broader AI breakthroughs on the horizon. The bigger question is, how do we measure it? How do we begin to understand it? What examples can we point to that show exactly how massive the value of data has become?
The good news is the internet, possibly the largest source of data value on earth, provides a great case study in data, its usage, and its exponentially growing value. Let’s take a look at why the internet might be the most visible and timely battleground to analyze and learn just how critical data will be in this new AI era.
The ebb and flow of the internet
We started with links and search engines. Early portals gave way to Google search dominance – a page with a single search bar that took you where you wanted to go. This soon gave way to social media platforms that brought your attention back to centralized hubs where you could scroll through curated lists of content that were brought to your feed. In an effort to create freedom from platforms and their algorithms and democratize access and ownership of digital content, we saw the rise of Web3 and the promise of the internet moving back towards decentralization. However, the Web3 movement relied on advanced technical skills and a trust in a broad community that had some high-profile challenges in recent years.
Could GenAI be the next iteration of the Web? A Web4?
GenAI has burst onto the scene and has become a sensation that every organization is now looking at. It’s changing how developers and ops folks are working. Every organization is rushing to explore how it can impact their business, touted in many areas. Let’s unpack how it changes the game for the internet.
GenAI offers ease of use. It’s why it’s captured the attention of users and organizations everywhere, and why it offers the potential to reshape the internet. Because GenAI suggests a return to a centralized web consumption experience, it’s not surprising that search engines are jumping in on this trend and trying to bolster their positions. Simply put, it’s doing something Web3 never did: making it much easier for users to get to what they want.
Here’s how it works: instead of going to a recommended destination where they must identify the answer themself, the user simply asks the GenAI application a question and the answers are brought to them in a conversational way. This is a potentially earth-shattering level of change to the internet – for the user to be able to interrogate the data and land on exactly what they were looking for is very elusive. This suggests the new Web4 era will be centered around personalization and the ability to interact with data similar to how we interact as humans. It is also creating the battle lines for what the future of the internet could look like, by bypassing the existing players and their platforms and surfacing up results in conversational responses.
GenAI training and internet scraping
When building and training a foundation model for GenAI, data quantity and quality is critical to achieving the best outcomes. As such, the internet was one of the earliest places that many of these AI models looked to. Where else could you get so much data from so many active participants? Plus, it was data that was freely available and, in many cases, the users of these platforms were actively giving their data away. But it gets murky because data ownership and whether there were intellectual property protections in play still is not readily understood nor legally tested.
Add to that many of the early players in this space were startups with a “move fast and break stuff” mentality, and AI and the internet very much is the wild west. We’ve seen early battlegrounds drawn by companies like Getty Images, who see GenAI images that pull from their archives as derivative works and are arguing these tools are more like Napster than something novel and new. But perhaps an area that is sparking the biggest seismic shifts is in social media, where typically, the platforms are granted some level of access and assigned rights via terms of service and offer a free platform as a result. The challenge here is that many had built extensions to their platforms using APIs or allowed unfettered access to users, and now they see GenAI as a massive threat to their efforts and valuation.
Social media strikes back
Recently, we’ve seen many social media companies go on the offensive, restricting access to GenAIs and leaving their users in the crossfire. It’s not hard to see why this is occurring; these AIs can collect and consume a sea of content and provide a contextual search of that content that is completely personalized.
What a value proposition for an audience. With things like GitHub Copilot, a simple GenAI prompt can surface documentation or code snippets in seconds. Stack Overflow, Reddit and Twitter have all moved to start charging for API access to their site’s content. It makes sense; if a perceived competitor is gathering and using all your data to gain a competitive advantage, why wouldn’t you seek compensation or limit access? Reddit has also recently moved to restrict third-party apps, and interestingly, that has sometimes put them at odds with their community moderators.
Then there’s Twitter. Over the 4th of July weekend, the social media platform began to temporarily limit user access to its content. We’re also beginning to see some of this battle spill over into other web properties, such as news outlets and GenAI powered search defeating paywalls. In a world where content and data become products themselves, we will continue to see this tug of war. Decisions are being made that are shaking the internet to its very foundation, and the sole reason for any of this is just how precious this data is.
What the internet can tell you about your own data
When thinking about your enterprise or business data versus what might commonly be found on the internet, consider this: your data’s probably way more valuable. We’re seeing tech behemoths fight tooth and nail to protect their data and IP, even when it’s user-generated and potentially available in many other platforms and forms, or full of low-quality data like spam and bot networks. With GenAI, the proportion of data that holds value has increased exponentially. This means that organizations may have to re-evaluate their existing notions on data because GenAI has changed the equation.
If social media companies work this hard to limit access and bolster their competitive position with data, leaders embracing AI must take similar steps. Lean into the world of ambiguity – the use cases may not yet be obvious – but the answer will lie in the data and retention policies must change as a result. Consider where your data sits and how best to get AI to it, because data gravity will still play a role. Understand how it’s used. Data and Intellectual Property leakage must be avoided; the data itself and the potential training it offers a foundation model could potentially reduce your differentiation.
Ultimately, what we are watching now in social media will play out repeatedly in other spaces; data will be the great differentiator. In this space, that means taking a hard look at your AI solutions and ensuring you are limiting areas of exposure. We’re rapidly approaching a world where every organization is data-driven and using AI. This means it will be more important than ever to protect your most valuable asset – your data – and never outsource your core competencies.
Learn how GenAI and automation lower the bar for data center management in our latest podcast, The Great Equalizer: GenAI and AI Transforming the Data Center and learn how to bring generative AI to your organization.
Brought to you by Dell Technologies.