AI vectorization signals end of the unstructured data era

Analysis: A VAST Data blog says real-time GenAI-powered user query response systems need vector embeddings and knowledge graphs created as data is ingested, not in batch mode after the data arrives.

VAST co-founder Jeff Denworth writes in a post titled “The End of the Unstructured Data Era” about retrieval-augmented generation (RAG): “AI agents are now being combined to execute complex AI workflows and enhance each other. If a business is beginning to run at the speed of AI, it’s unthinkable that AI engines should wait minutes or days for up-to-date business data.”

Jeff Denworth

He says early “solutions for chatbots … front-ended small datasets such as PDF document stores,” but now we “need systems that can store and index trillions of vector embeddings and be able to search on them in real time (in order to preserve a quality user experience).”

A vector embedding is a calculated number (vector) representing some unstructured data item’s presence on a scalable dimension, such as color, position, shape component, aural frequency, and more. A document, file, data object, image, video, or sound recording item can be analyzed by a vectorization engine and hundreds, thousands, or more vector embeddings generated that characterize the item. These vector embeddings are indexed and stored in a vector database. When a user makes a request to a GenAI chatbot, that request is turned into vectors and the vector database searched for similar vectors inside a so-called semantic space.

The user could supply an image and ask “What is the name of this image?” The image and question are vectorized and the chatbot semantically searches its general training dataset for image vectors that are closest to the user-supplied image, and responds with that painting’s name: “It is the Mona Lisa painted by Leonardo da Vinci,” or, more useful to an enterprise: “It is a Cisco Nexus 9200 switch.”

To ensure higher response accuracy, the chatbot could be given access to a customer organization’s data and its retrieved response from its database augmented with that, hence retrieval-augmented generation.

A knowledge graph is generated from structured or block data and stores and models relationships (events, objects, concepts, or situations) between so-called head and tail entities, with a “triple” referring to a head entity + relationship + tail entity. Such triples can be linked and the relationships are the semantics. Crudely speaking, they describe how data item pairs are linked in a hierarchy. Chatbots at the moment do not use knowledge graphs, but suppliers like Illumex are working in what we could call knowledge graph augmented retrieval.

Denworth’s company has announced a real-time data ingestion, vectorization, storage, and response AI-focused InsightEngine system, and his blog extols its virtues and explains the need for its development.

He writes: “We’re watching the exponential improvements in embedding models and seeing a time in the not-too-distant future where these tools will be of a caliber that they can be used to index the entirety of an enterprise knowledge base and even help to curate enterprise data. At that point, hyper-scale vector stores and search tools are table stakes.The trillion-way vector problem is already a reality for large AI builders like OpenAI and Perplexity that have gone and indexed the internet.”

As well as vectorizing existing data sets, companies will “need to be able to create, store and index embeddings in real time.”

“I think of vectors and knowledge graphs as just higher forms of file system metadata,” Denworth writes. “Why wouldn’t we want this to run natively from within the file system if it was possible?”

Existing file and object systems “force IT teams to build cumbersome retrieval pipelines and wrestle with the complexity of stale data, stale permissions and a lot of integration and gluecode headaches … The idea of a standalone file system is fading as new business priorities need more from data infrastructure.”

Let’s think about this from a business aspect. A business has data arriving or being generated in multiple places inside its IT estate: mainframe app environment, distributed ROBO systems, datacenter x86 server systems, top tier public cloud apps, SaaS apps, security systems, data protection systems, employee workstations, and more.

Following Denworth’s logic, all of this data will need vectorizing in real time, at the ingest/generation location point and time, and then stored in a central or linked (single namespace) database so that semantic searches can be run against it. That means that all the applications and storage systems will need to support local and instant vectorization – and knowledge graph generation as well.

There will need to be some form of vectorization standard developed and storage capacity would need to be put aside for stored vectors. How much? Let’s take a PDF image. Assuming 512 vector dimensions, and 32-bit floating point numbers per dimension, we’d need around 2 KB of capacity. Increase the dimension count and up goes the capacity. Halve the floating point precision and down goes the capacity.

This means that file-handling and object systems from Dell, DDN, HPE, Hitachi Vantara, IBM, NetApp, Pure Storage, Qumulo etc. would need to have vectorization, embedding storage, and metadata added to them – if Denworth is right. Ditto all the data lake and lakehouse systems. 

Ed Zitron

AI bubble or reality

Of course, this will only be necessary if the generative AI frenzy is not a bubble, and develops into a long-lived phenomenon with real and substantial use cases emerging. Commentators such as Ed Zitron have decided that they won’t. OpenAI and its like are doomed, according to critics, with Zitron writing: “it feels like the tides are rapidly turning, and multiple pale horses of the AI apocalypse have emerged: a big, stupid magic trick‘ in the form of OpenAI’s (rushed) launch of its o1 (codenamed: strawberry) model, rumored price increases for future OpenAI models (and elsewhere), layoffs at Scale AI, and leaders fleeing OpenAI. These are all signs that things are beginning to collapse.”

But consultancies like Accenture are going all-in on chatbot consultancy services. An Accenture Nvidia Business Group has been launched with 30,000 professionals receiving training globally to help clients reinvent processes and scale enterprise AI adoption with AI agents. 

Daniel Ives

Financial analysts like Wedbush also think the AI hype is real, with Daniel Ives, managing director, Equity Research, telling subscribers: “The supply chain is seeing unparalleled demand for AI chips led by the Godfather of AI Jensen [Huang] and Nvidia and ultimately leading to this tidal wave of enterprise spending as AI use cases explode across the enterprise. We believe the overall AI infrastructure market opportunity could grow 10x from today through 2027 as this next generation AI foundation gets built with our estimates a $1 trillion of AI capex spending is on the horizon the next three years.

“The cloud numbers and AI data points we are hearing from our field checks around Redmond, Amazon, and Google indicates massive enterprise AI demand is hitting its next gear as use cases explode across the enterprise landscape.”

Ben Thompson

Stratechery writer Ben Thompson is pro AI, but thinks it will take years, writing: “Executives, however, want the benefit of AI now, and I think that benefit will, like the first wave of computing, come from replacing humans, not making them more efficient. And that, by extension, will mean top-down years-long initiatives that are justified by the massive business results that will follow.”

Who do we believe? Zitron or the likes of VAST Data, Accenture, Wedbush, and Thompson? Show me enterprises saving or making millions of dollars from GenAI use cases with cross-industry applicability and the bubble theory will start receding. Until that happens, doubters like Zitron will have an audience.