Cohesity Gaia and vector embedding the backup data estate

Listening to Cohesity’s Gaia guy, Greg Statton, talking to an IT Press Tour audience about having a conversation with your data, and CEO Sanjay Poonen saying Gaia, with RAG, will provide insight into your private data, prompted a thought about how the data will be made ready.

Gaia (Generative AI Application) provides data in vector embedding form to large language models (LLMs) so they can use it to generate answers to questions and requests. Being Cohesity, the source data is backups (secondary data) and also primary data. An enterprise could have a petabyte or more of backup data in a Cohesity repository.

Greg Statton, Office of the CTO – Data and AI at Cohesity.

Statton told us companies adopting Gaia will need to vectorize their data – all or part of it. Think of this as a quasi-ETL (Extract, Transform, Load) with raw structured or unstructured data being fed to a vector embedding process which calculates vector embeddings and stores them in a Gaia database.

This is all very well but how does the raw data get transformed into a format usable by LLMs? It could take a long time to vectorize a whole data set. This is a non-trivial process.

Suppose the customer has Veritas backup data on tape. If customers want to use Gaia and AI bots to query tape-based data then they have to vectorize it first. That means restoring the backup and identfying words, sentences, mails, spreadsheets, VMs, etc. It will take a while – maybe weeks or months. But this will need to be done if a customer wants to query the entirety of their info.

How is it done? The raw data is fed to a pre-trained model – such as  BERT, Word2Vec, and ELMo – which vectorizes text data. There are other pre-trained models for audio and image data.

A Datastax document explains: “Managing and processing high-dimensional vector spaces can be computationally intensive. This can pose challenges in terms of both computational resources and processing time, particularly for very large datasets.”

It gets worse, as we are assuming that restored backup and other raw data can be fed straight into a vector embedding model. Actually the data may need preparing first – removing noise, normalizing text, resizing images, converting formats, etc. – all depending on the type of data you are working with.

Barnacle.ai founder John-David Wuarin has worked out a tentative timescale to have a million documents transformed into vector embeddings using openAI. He starts with a price: $0.0001 per 1000 tokens, where a token is a chunk is formed from a part of a word, a whole word or a group of words.

For one document of 44 chunks of 1000 tokens each, it’ll be 44 x $0.0001 = $0.0044. For 1,000,000 documents with an average of 44 chunks of 1000 tokens each: 1,000,000 x 44 x $0.0001 = $4,400. There are rate limits for a paid user: 3,500 requests per minute and 350,000 tokens per minute. He reckons embedding all the tokens would take 1,000,000 x 44 x 1000 ÷ 350,000 = 125,714.285 minutes – or 87 days. It is a non-trivial process.

Statton believes that enterprises wanting to make use of Gaia will probably run a vector embedding ETL process on subsets of their data and proceed in stages, rather than attempting a big bang approach. He noted: “We vectorize small logical subsets of data in the background.”

Customers will use Gaia, meaning Azure OpenAI from the get-go to generate answers to users’ input. “But,” he said, “we can use other LLMs. Customers can bring their own language model,” in a BYOLLM fashion.

Statton also reckons the models don’t matter – they’re useless without data. The models are effectively commodity items. The suppliers who own the keys to the data will win – not the model owners. But the data has to be vectorized, and incoming fresh data will need to be vectorized as well. It will be a continuing process once the initial document/backup stores have been processed.

Gaia promises to provide fast and easy conversational access to vectorized data, but getting the whole initial data set vectorized will likely be neither fast nor easy. Nor particularly cheap.