MinIO blends object storage and table data for GenAI workloads

MinIO is becoming a multi-modal data store for AI, providing tools for it to access and use that data. This has become evident in a lengthy interview with MinIO co-founder and co-CEO AB Periasamy.

In the first part of this interview, we found out that MinIO’s support of providing fast access to object data for AI led it to realize it was, at heart, a key-value database company, and that key-value stores could contain both unstructured, object-style data, and structured, tabular, Iceberg-style data. 

From the AI point of view, it makes no sense to vectorize Iceberg-style data items as, unlike words or images, they don’t stand alone with dimensional aspects that you can vectorize. There needs to be some intervening logic/abstraction between the tables and the GenAI LLMs and agents that bridges their unstructured data vector focus and the searching of structured data. This is what we are going to look at in part two.

Blocks & Files: How can MinIO, which is an unstructured data company holding object storage, help with bringing structured data to GenAI models? Will the structured data become an object? Are you going to assimilate it somehow?

AB Periasamy

AB Periasamy: Early on in the object store, we saw that structured data is a layer above unstructured data. Iceberg for example is nothing but a collection of Parquet-like objects. Basically, if you have a 10 PB table, you’re not going to store a 10 terabyte object. It’s going to be a collection of smaller Parquet objects. Every 10 million rows sorted, segmented into one Parquet object and then you have huge collection of Parquet objects.

We are continuously dumping data captured from other databases as change data capture or machine-generated telemetry data. All of them are coming towards this table format, but then the table data is written to object store as Parquet objects, just objects, and there is a manifest file (table of contents) and metadata description. All of them are simply a layer above object store.

Very early on, I thought that structured data will go to database, unstructured data will go to object store. They will sit side by side. Then we saw the evolution, that structured data came on top of ObjectStore because enterprises were simply dealing with the scale of structured data. Databases rewrote their backend to go to ObjectStore. That led [us to question], why have different data proprietary format specific to ObjectStore, specific to databases. That led to the open table format.

The next big evolution that we see is, if tables are a big part of objects, storage tables should not be a second class citizen; tables should be a first class citizen.

What we are working on now is: there are objects and there are tables. Structured data goes to tables. Unstructured data goes to objects, and both become first-class citizens inside the object store.

We already showed with the promptObject [see Bootnote below] if you have a sales receipt or any kind of unstructured data, you can query the unstructured data asset as if it was structured data with the prompt object. But if all you are storing is structured data, there’s no need for emulation. Instead the models need to discover, say, if you have a hundred petabytes worth of table data.

This is unlike a database query which only needs to touch upon a subset of the data. GenAI needs to understand the entire data set. How does the GenAI even discover that it is now in Iceberg, that there is a catalogue, an API that discovers all kinds of tables, their metadata, their range?

Blocks & Files: My understanding is that for unstructured data, an agent or a large language model needs it to be vectorized so that it can then search in a semantic space and generate its responses. With structured data, you don’t need it to be vectorized. You can search it by existing means. So does that mean that MinIO will, in the unstructured data sense, be doing things to support vectorization? Possibly to store vectors and feed them to the large language models?

AB Periasamy: If it is unstructured data, you vectorize it, but if it is structured data, how do you even vectorize this, right? … In structured data, can I vectorize every row? It will not make any sense at all. The right way to deal with this is there’s a linking layer that AI first needs to discover the tables. Also it’s very much context-specific. So the goal should be to load the structured data, data frames into the memory. The good part about Iceberg is there are plenty of libraries in Rust, Python; there are several frameworks to load this data.

There’s a linking layer that knows how to load on demand what you are looking for. It’s like KV cache in the GPU space. You cannot load all of your tables into the memory. The first thing you go through is a discovery phase of what, say, a business analyst from a business side is asking. A very high level question. … The business user knows what you need to discover from a business point of view. Translating that, the fastest step is load the table, query the tables, and then once you know what subset of the data is interesting for this job, that data alone is loaded from table into the GPU memory. 

What we find is vectorization is not helping us here. Instead use the metadata information and that creates a structure that you feed to the AI, and use AI’s ability to generate code. So actually the agents are getting created on the fly. The link here is that GenAI writes code and that code understands the structure of the tabular data.

That’s how we brought both together as compared to the prompt engineering style integration.

Blocks & Files: OK, so you would not be using knowledge graphs, which have been suggested as a way of bringing relational database information to AI?

AB Periasamy: When you come to structured data, the knowledge graph actually in some sense is small, it’s just within the metadata information, but once you know a particular column, for example, how do you even build a knowledge graph? It is actually quite linear and they are kind of segmented, sorted, deduplicated data. This structured data world is quite different. It’s not like there is no knowledge graph, there is always a knowledge graph in terms of understanding the structure, the overall meta structure of the data, but the structured data itself – we need to deal with it very differently.

That’s where we see that the path to that structured data querying is not actually querying. It’s about understanding structured data by using AI to write code that understands structured data, which works out better than AI trying to comprehend at a human language level.

If it’s art, poetry or anything, there is a conversation of one line to the next line to the next line. There’s source code. There’s logic to it. Look at a table, though. Just take ten rows. They all look identical.

Structured data is numbers. To understand that you still need mathematical algorithms, and AI can write code literally like a data scientist. It can spin up agents that are data scientist-equivalent and multiple such agents can understand the structure of the numbers. It is just a mathematical operation. Once AI gets that number result, then it combines that with the unstructured knowledge.

Just like with any other data analyst, data engineer, the difference here is that GenAI becomes your data team. You would have a visualization expert, data engineers. We have a whole team to understand how to deal with structured data. The GenAI part essentially replaces your data team. That’s what translates to most business value for the customer because, from a customer point of view, all I need is a business analyst. I can have business analysts ask these agents, the agents can go figure out how to get all this done.

Blocks & Files: Will you have some kind of separate metadata store for this structured data?

AB Periasamy: We’ve always said no metadata database because that’s the heart of all scalability challenges. If I bring in a metadata database in structured data, that scale is even larger and if I blow up the metadata database, all your structured data is gone. There’s no way to reconstruct it.

The Iceberg format was designed specifically to address this kind of challenge and the last piece, was the catalog itself. There is a built-in catalog coming inside the object store itself and that catalog doesn’t require a metadata database. The manifest files themselves contain the metadata. If you manifest your objects, we can selectively load them on demand into their memory. 

Blocks & Files: You have a separate metadata object. They’re objects. You have a separate class of objects.

AB Periasamy: I have plenty of memory at a cluster scale, more than any other database can do, and nothing beats memory. So we can use the distributed memory to hold all the table metadata information. You crash the cluster, you didn’t lose anything because they’re simply loaded from persistent objects sitting in the object store.

Blocks & Files: Forgive me if I should know this already, but I imagine you’d be providing some kind of KV cache offload.

AB Periasamy: All of the inference engines already have a KV cache built in. The offload allows KV cache actually to store and retrieve, rehydrate the cache and also handle very large memory and they need the fastest access to a key value store, in some sense it’s like virtual page memory in older times. We already have support for the KV cache offload interface and we are working with various inference engines now and customers to adopt it.

The interesting part about this is that the vast majority of the enterprises are still behind. But those who are at the very core of this GenAI world, particularly the GenAI clouds, the GPU cloud vendors, they need to solve this problem. They’re the ones today addressing enterprise challenges. We are working with these players to do a KV cache offload.

Comment

We should realize that MinIO is not just an object storage supplier. It is a GenAI LLM and agent source data supplier, supporting vectors for unstructured data and SQL-type queries for structured data. Its own LLMs write the SQL code apps needed for each query that a business analyst types in.

In this sense, MinIO bridges the object storage, vector database, and SQL database access worlds. It is, in this regard, somewhat similar to VAST Data in being a multi-modal data store for AI and providing tools to access and use that data.

Bootnote

MinIO’s promptObject API is an S3 API extension that “lets users or applications talk to unstructured objects as if they were talking to an LLM. That means you can ask an object to describe itself, to find similarities with other objects and to find differences with other objects.”

“The promptObject API is effectively transparent to the user/application. No prior knowledge of RAG models, vector databases or other AI concepts is required. The promptObject API works out of the box with multi-agentic architectures where orchestration is built in to work with small-scale, domain AI specific models.”

For example, there could be an image of a restaurant receipt in the object store and that receipt has a Guest word on it followed by the number 4. The promptObject API can be used to “ask the object how many people came to dinner?” 

“The user can ask almost any question about the receipt. What was the average check size, what city is it in, what is the image at the top, what was the most expensive dish? 

“MinIO runs a multi-modal LLM on the backend and takes care of everything. It is totally transparent to the IT user or application developer (but naturally open to the data science team for inspection). This does require GPUs, but the team could get started with just one.”