AI data pipelines could use a hand from our features, says Komprise

Interview. AI training and inferencing needs access to datasets. The dataset contents will need to be turned into vector-embedded data for GenAI large language models (LLMs) to work on them, with their semantic search capabilities finding responses from vectorized data to requests that have been vectorized as well.

In one sense, providing vectorized datasets to LLMs requires extracting the relevant data from its raw sources: files, spreadsheets, presentations, mail, objects, analytic data warehouses, and so on, turning it into vectors and then loading it into a store for LLM use, which sounds like a traditional ETL (Extract, Transform and Load) process. But Krishna Subramanian, Komprise co-founder, president, and COO, claims this is not so. The Transform part is done by the AI process itself.

Komprise provides Intelligent Data Management software to analyze, move, and manage unstructured data, including the ability to define and manage data pipelines to feed AI applications, such as LLMs. When LLMs are used to search and generate responses from distributed unstructured datasets then the data needs to be moved into a single place that the LLM can access, passing through a data pipeline.

Filtering and selecting data from data set sources, and moving it, is intrinsic to what the Komprise software does and here Subramanian discusses AI data pipeline characteristics and features.

Blocks & Files: Why are data pipelines becoming more important today for IT and data science teams?

Krishna Subramanian, Komprise
Krishna Subramanian

Krishna Subramanian: We define data pipelines as the process of curating data from multiple sources, preparing the data for proper ingestion, and then mobilizing the data to the destination. 

Unstructured data is large, diverse, and unwieldy – yet crucial for enterprise AI. IT organizations need simpler, automated ways to deliver the right datasets to the right tools. Searching across large swaths of unstructured data is tricky because the data lacks a unifying schema. Building an unstructured data pipeline with a global file index is needed to facilitate search and curation. 

On the same note, data pipelines are an efficient way to find sensitive data and move it into secure storage. Most organizations have PII (Personal Identifying Information), IP and other sensitive data inadvertently stored in places where it should not live. Data pipelines can also be configured to move data based on its profile, age, query, or tag into secondary storage. Because of the nature of unstructured data, which often lives across storage silos in the enterprise, it’s important to have a plan and a process for managing this data properly for storage efficiencies, AI, and cybersecurity data protection rules. Data pipelines are an emerging solution for these needs.

Blocks & Files: What kind of new tools do you need?

Krishna Subramanian: You’ll need various capabilities, aside from indexing, many of which are part of an unstructured data management solution. For example, metadata tagging and enrichment – which can be augmented using AI tools – allows data owners to add context and structure to unstructured data so that it can be easily discovered and segmented. 

Workflow management technologies automate the process of finding, classifying and moving data to the right location for analysis along with monitoring capabilities to ensure data is not lost or compromised through the workflow. Data cleansing/normalization tools and of course data security and governance capabilities track data that might be used inappropriately or against corporate or regulatory rules. 

There is quite a lot to consider when setting up data pipelines; IT leaders will need to work closely with research teams, data scientists, analysts, security teams and departmental heads to create the right workflows and manage risk.

Blocks & Files: How are data pipelines evolving with GenAI and other innovations?

Krishna Subramanian: Traditionally, data pipelines have been linear, which is why ETL was the norm. These tools were designed for structured and semi-structured data sources where you extract data from different sources, transform and clean up the data, and then load it into a target data schema, data warehouse or data lake.

But GenAI is not linear; it is iterative and circular because data can be processed by different AI processes, each of which can add more context to the data. Furthermore, AI relies on unstructured data which has limited metadata and is expensive to move and load.

Since data is being generated everywhere, data processing should also be distributed; this means data pipelines must no longer require moving all the data to a central data lake first before processing. Otherwise, your costs for moving and storing massive quantities of data will be a detriment to the AI initiative. Also, many AI solutions have their own ways of generating RAG and vectorization of unstructured data. 

Unlike ETL, which focuses heavily on transformation, data pipelines for unstructured data need to focus on global indexing, search, curation, and mobilization since the transformation will be done locally per AI process.

Blocks & Files: What role do real-time analytics play in optimizing data pipelines?

Krishna Subramanian: Real-time analytics is a form of data preprocessing where you can make certain decisions on the data before moving it. Data preprocessing is central in developing data pipelines for AI because it can iteratively enrich metadata before you move or analyze it. This can ensure that you are using the precise datasets needed for a project – and nothing more. Many organizations do not have distinct budgets for AI, at least not on the IT infrastructure side of the house and must carve funds out from other areas such as cloud and data center. Therefore, IT leaders should be as surgical as possible with data preparation to avoid AI waste.

Blocks & Files: How can companies leverage data pipelines to improve collaboration between data science teams, IT, and business units?

Krishna Subramanian: Data pipelines can create data workflows between these different groups. For instance, researchers who generate data can tag the data – otherwise called metadata enrichment. This adds context for data classification and helps data scientists find the right datasets. IT manages the data workflow orchestration and safe data movement to the desired location or can integrate third-party AI tools to work on datasets without moving them at all. This is a three-way collaboration on the same data facilitated by smart data workflows leveraging data pipelines.

Blocks & Files: What trends do you foresee in data pipeline architecture and how can enterprises prepare for these evolving technologies and approaches?

Krishna Subramanian: We see that data pipelines will need to evolve to address the unique requirements of unstructured data and AI. This will entail advances in data indexing, data management, data pre-processing, data mobility and data workflow technologies to handle the scale and performance requirements of moving and processing large datasets. Data pipelines of unstructured data for AI will focus heavily on search, curate and mobilization with the transformation happening within the AI process itself.

Blocks & Files: Could Komprise add its own chatbot-style interface?

Krishna Subramanian: Our customers are IT people. They know how to build a query. They know how to use our UI. What they really want is connecting their corporate data to AI. Can we reduce the risk of it? Can we improve the workflow for it? That’s a higher priority than us adding chat, which is why we have prioritized our product work more around the data workflows for AI.

Blocks & Files: Rubrik is aiming to make the data stored in its backups available for generative AI training and or inference with its Annapurna project. Rubrik is not going to supply its own vectorization facilities or its own vector database, or indeed its own large language models. It’s going to be able to say to its customers, you can select what data you could feed to these large language models. Now that’s backup data. Komprise will be able to supply real-time data. Is that a point of significant difference?

Krishna Subramanian: Yes, that’s a point of significant difference … We were at a Gartner conference last month and … Gartner did a session on what do customers want from storage and data management around AI. And … a lot of people think AI needs high performance storage. You see all this news about GPU-enabled storage and storage costs going up and all of that. And that’s not actually true. Performance is important, but only for model training. And model training is 5 percent of the use cases. 

In fact, they said 50 percent of enterprises will never train a model or even engineer a prompt. You know 95 percent of the use cases is using a model. It’s inferencing. 

And a second myth is AI is creating lot of data, or, hey, you’re backing up data. Can you run AI on your backup data? Yes, maybe there is some value to that, but most customers really want to have all their corporate data across all their storage available to AI, and that’s why Gartner [is] saying data management is more important than storage for AI.

We build a global file index. And this is not in the future. We already do this. You point this at all your storage and … we’re actually creating a metadata base. We’re creating a database of all the metadata of all the files and objects that we look at. And this is not backup data. It’s all your data. It’s your actual data that’s being stored. 

Komprise graphic

So whether you back it up or not, we will have an index for you. With our global file index you can search across all the data. You can say, I only want to find benefits documents because I’m writing a benefits chat bot. And anytime new benefits documents show up anywhere, find those and feed those to this chat bot agent and Komprise will automatically run that workflow. 

And every time new documents show up in Spain or in California or wherever, it would automatically feed that to that AI and it would have an audit trail. It will show what was spent. It will show which department asked for this. It will keep all of that so that for your data governance for AI, you have a systematic way to enable that.

Blocks & Files: Would it be fair to say that the vast majority of your customers have more than one unstructured storage, data device supplier, and building on that, those suppliers cannot provide the enterprise-wide object and file estate management capability you can?

Krishna Subramanian: Yes, that is exactly correct. Sometimes people might say, “Well, no, I don’t really have many suppliers. I might only use NetApp for my file storage.” But how much do you want to bet they’re also using AWS or Azure? So you do have two suppliers then. If you’re using hybrid cloud, by definition, you have more than one supplier, yes. I agree with your statement. And that’s what our customers are doing. That’s why this global file index is very powerful, because it’s basically adding structure to unstructured data across all storage. 

And to your point, storage vendors are trying to say, look, my storage, file system can index data that’s sitting on my storage.

Blocks & Files: So you provide the ability to build an index of all the primary unstructured data there is in a data estate and regulate access to it, to detect sensitive information within it, because you build metadata tables to enable you to do that. So you could then feed data to a large language model, which would satisfy compliance and regulation needs concerning access. It would be accurate, it would be comprehensive, and you can feed it quickly to the model?

Krishna Subramanian: That’s correct. And we actually have a feature in our product called smart data workflows, where you can just build these workflows.

This is a contrived example; you know you can write a chatbot in Azure using Azure OpenAI. The basic example they have is a chat bot that has read a company’s health documents, and somebody can then go and ask it a question; What’s the difference in our company between two different health plans? And then it’ll answer that based on the data it was given, right? 

So now let’s say California added some additional benefits. In the California division of this company Komprise finds those documents, feeds them into that OpenAI chatbot, and then, when the user asked the same question, it gives you a very specific answer, because the data was already fed in, right? 

But really what’s more important is what’s happening behind the scenes. Azure Open AI has something called a knowledge base. It was trained with certain data, but you can actually put additional data, corporate data, in a Blob container, which it indexes regularly, to augment the process. So the RAG augmentation is happening through that container. 

Komprise has indexed all the storage in our global file index. So you just build a workflow saying, find anything with benefits, that’s a Komprise is automatically doing that regularly, and that’s how this workflow is running. And the beauty of this is you don’t have to run this every time anybody could be creating a new benefits document, and it will be available to your chat bot. 

Part of the problem is generative AI can become out of date because it was trained a while back. So this addresses relevancy. It addresses recency, and it also addresses data governance. Because you can tell Komprise if a source has sensitive data, don’t send it. So it can actually find sensitive data. That’s a feature we’re adding. We’ll be announcing soon.

You can tell Komprise to look for social security numbers, or you can even tell it to look for a particular keyword, a particular regular expression, because maybe in your organization, certain things are sensitive because of a certain way you label things. Komprise will find that inside the contents, not just a file name, and it will exclude that data if you if that’s what you want. So it can find personally identifiable information, the common stuff, social security numbers and so forth. But it can also find corporately sensitive information, which PII doesn’t cover.

Blocks & Files: If I’m a customer that doesn’t have Komprise, that doesn’t have any file lifecycle management capability at all, then probably my backups are my single largest cross-vendor data store. So it would make sense to use them for our AI. But as soon as that customer wheels you in, the backup store is behind the times, it’s late, and you can provide more up-to-date information.

Krishna Subramanian: Yes, that’s what we feel. And, by the way, we point to any file system. So if a backup vendor exposes their backup data in a file system, we can point to that too. It doesn’t matter to us. If a backup vendor stores its data on an object storage system? Yes, it works, because we are reading the objects. So if I happen to be a customer with object-based appliances storing all my VM backup data, we say, fine, no problem, we’ll index them – because we’re reading the objects. We don’t need to be in their proprietary file system. That’s the beauty of working through standards.

Blocks & Files: I had thought that object storage backup data and object storage were kind of invisible to you.

Krishna Subramanian: Well, it’s not invisible as long as they allow the data to be read as objects. It would be visible if they didn’t actually put the whole file as an object, if they chunked it up, and if it was proprietary to their file system, then it would be invisible, because the fact that they use object store doesn’t matter. They’re not exposing data as objects. So if they expose data as objects or files, we can access it. 

As with NetApp, even though NetApp chunks the data, it exposes it via file and object protocols, and we read it as a file or an object. We don’t care how ONTAP stores it internally.

Blocks & Files: How is Komprise’s business growth doing?

Krishna Subramanian: Extremely well. New business is growing, I think, over 40 percent again this year. And the overall business is also growing rapidly. Our net dollar retention continues to be north of, I think, 110 percent. Some 30 to 40 percent of our new business comes from expansions, from existing customers.