Starburst CEO: In AI, it’s data access that wins

Interview: Startup Starburst develops and uses Trino open source distributed SQL to query and analyze distributed data sources. We spoke to CEO Justin Borgman about the company’s strategy.

A little history to set the scene, and it starts with Presto. This was a Facebook (now Meta) open source project from 2012 to provide analytics for its massive Hadoop data warehouses by using a distributed SQL query engine. It could analyze Hadoop, Cassandra, and MySQL data sources and was open sourced under the Apache license in 2013.

The four Presto creators – Martin Traverso, Dain Sundstrom, David Phillips, and Eric Hwang – left in 2018 after disagreements over Facebook’s influence on Presto governance. 

They then forked the Presto code to PrestoSQL. Facebook donated Presto to the Linux Foundation in 2019, which then set up the Presto Foundation. By then, thousands of businesses and other organizations were Presto users. PrestoSQL was rebranded to Trino to sidestep potential legal action after Facebook obtained the “Presto” trademark. The forkers set up Starburst in 2019, with co-founder and CEO Justin Borgman, to supply Trino and sell Trino connectors and support. 

Borgman co-founded SQL-on-Hadoop company Hadapt in 2010. Hadapt was bought by Teradata in 2014 with Borgman becoming VP and GM of its Hadoop portfolio unit. He resigned in 2019 to join the other Starburst founders.

Eric Hwang is a distinguished engineer at Starburst. David Phillips and Dain Sundstrom both had CTO responsibilities, but they left earlier this year to co-found IceGuard, a stealth data security company. Martin Traverso is Starburst’s current CTO.

Starburst graphic

Starburst has raised $414 million over four rounds in 2019 ($22 million A-round), 2020 ($42 million B-round), 2021 ($100 million C-round), and 2022 ($250 million D-round).

It hired additional execs in early 2024 and again later that year to help it grow its business in the hybrid data cloud and AI areas.

Earlier this year, Starburst reported its highest global sales to date, including significant growth in North America and EMEA, with ARR per customer over $325,000. There was increased adoption of Starburst Galaxy, its flagship cloud product, by 94 percent year-over-year, and it signed its largest ever deal – a multi-year, eight-figure contract per year, with a global financial institution.  

Blocks and Files: Starburst is, I think, a virtual data lakehouse facility in that you get data from various sources and then feed it upstream to whoever you need to.

Justin Borgman, Starburst
Justin Borgman

Justin Borgman: Yeah, I like that way of thinking about it. We don’t call ourselves a virtual lakehouse, but it makes sense.

Blocks and Files: Databricks and Snowflake have been getting into bed with AI for some time, with the last six to nine months seeing frenetic adoption of large language models. Is Starburst doing the same sort of thing?

Justin Borgman: In a way, yes, but maybe I’ll articulate a couple of the differences. So for us, we’re not focusing on the LLM itself.

We’re basically saying customers will choose their own LLM, whether that’s OpenAI or Anthropic or whatever the case may be. But where we are playing an important role is in those agentic RAG workflows that are accessing different data sources, passing that on to the LLM to ensure accurate contextual information. 

And that’s where we think we actually have a potential advantage relative to those two players. They’re much larger than us, so I can see they’re further along. But as you pointed out, we have access to all the data in an enterprise, and I think in this era of agents and AI, it’s really whoever has the most data that wins, I think, at the end of the day. And so that’s really what we provide is access to all of the data in the enterprise, not just the data in one individual lake or one individual warehouse, but all of the data.

Blocks and Files: That gives me two thoughts. One is that you must already have a vast number of connectors connecting Starburst to data sources. I imagine an important but background activity is to make sure that they’re up to date and you keep on connecting to as many data sources as possible.

Justin Borgman: That’s right.

Blocks and Files: The second one is that you are going to be, I think, providing some kind of AI pipeline, a pipeline to select data from your sources, filter it in some way. For instance, removing sensitive information and then sending it upstream, making it available. And the point at which you send it upstream and say Starburst’s work stops could be variable. For example, you select some filters, some data from various sources, and there it is sitting in, I guess, some kind of table format. But it’s raw data, effectively, and the AI models need it tokenized. They need it vectorized, which means the vectors have to be stored someplace and then they use it for training or for inference. So where does Starburst activity stop?

Justin Borgman: Everything you said is right. I’m going to quantify that a little bit. So we have over 50 connectors to your earlier point. So that covers every traditional database system you can think of, every NoSQL database, basically every database you can think of. And then where we started to expand is adding large SaaS providers like Salesforce and ServiceNow and things of that nature as well. So we have access to all those things. 

You’re also correct that we provide access control across all of those and very fine grain. So row level, column level, we can do data masking and that is part of the strength of our platform, that the data that you’re going to be leveraging for your AI can be managed and governed in a very fine-grained manner. So that’s role-based and attribute-based access controls. 

To address your question of where does it stop, the reason that’s such a great question is that actually in May, we’re going to be making some announcements of going a bit further than that, and I don’t want to quite scoop myself yet, but I’ll just say that I think in May you will see us doing pretty much the entire thing that you just described today. I would say we would stop before the vectorization and that’s where we stop today.

Blocks and Files:  I could see Starburst, thinking we are not a database company, but we do access stored vaults of data, and we probably access those by getting metadata about the data sources. So when we present data upstream, we could either present the actual data itself, in which case we suck it up from all our various sources and pump it out, or we just use the metadata and send that upstream. Who does it? Do you collect the actual data and send it upstream or does your target do that?

Justin Borgman: So we actually do both of the things you described. First of all, what we find is a lot of our customers are using an aspect of our product that we call data products, which is basically a way of creating curated datasets. And because, as you described it, we’re this sort of virtual lakehouse, those data products can actually be assembled from data that lives in multiple sources. And that data product is itself a view across those different data sources. So that’s one layer of abstraction. And in that case, no data needs to be moved necessarily. You’re just constructing this view. 

But at the end of the day, when you’re executing your RAG workflows and you’re passing data on, maybe as a prompt, to an LLM calling an LLM function, in those cases, we can be moving data. 

Blocks and Files: If you are going to be possibly vectorizing data, then the vectors need storing someplace, and you could do that yourself or you could ring up Pinecone or Milvus or Weaviate. Is it possible for you to say which way you are thinking?

Justin Borgman: Your questions are spot on. I’m trying to think of what I should say here … I’ll say nothing for today. Other than that, that is a perfect question and I will have a very clear answer in about six weeks.

Blocks and Files: If I get talking to a prospect and the prospect customer says, yes, I do have data in disparate sources within the individual datacenters and across datacenters and in the public cloud and I have SaaS datasets, should I then say, go to a single lakehouse data warehouse supplier, for example, Snowflake or Databricks or something? Or should I carry on using where my data currently is and just virtually collect it together as and when is necessary with, for example, Starburst? What are the pros and cons of doing that?

Justin Borgman: Our answer is actually a combination of the two, and I’ll explain what I mean by that. So we think that storing data in object storage in a lake in open formats like Iceberg tables is a wonderful place to store large amounts of data. I would even say as much as you reasonably can because the economics are going to be ideal for you, especially if you choose an open format like Iceberg, because the industry has decided that Iceberg is now the universal format, and that gives you a lot of flexibility as a customer. So we think data lakes are great. However, we also don’t think it is practical for you to have everything in your lake no matter what. Right? It is just a fantasy that you’ll never actually achieve. And I say this partly from my own experience…

So we need to learn from our past mistakes. And so I think that the approach has to have both. I think a data lake should be a large center of gravity, maybe the largest individual center of gravity, but you’re always going to have these other data sources, and so your strategy needs to take that into account.

I think that the notion that you have to move everything into one place to be able to have an AI strategy is not one that’s going to work well for you because your data is always going to be stale. It’s never going to be quite up to date. You’re always going to have purpose-built database systems that are running your transactional processing and different purposes. So our approach is both. Does that make sense?

Blocks and Files: It makes perfect sense, Justin. You mentioned databases, structured data. Can Starburst support the use of structured data in block storage databases?

Justin Borgman: Yes, it can.

Blocks and Files: Do you have anything to do or any connection at all with knowledge graphs for representing such data?

Justin Borgman: We do have connectors to a couple of different graph databases, so that is an option, but I wouldn’t say it’s a core competency for us today.

Blocks and Files: Stepping sideways slightly. Backup data protection companies such as Cohesity and Rubrik will say, we have vast amounts of backed-up data in data stores, and we’re a perfect source for retrieval-augmented generation. And that seems to me to be OK, up to a point. If you met a prospect who said, well, we’ve got lots of information in our Cohesity backup store, we’re using that for our AI pipelines, what can you do there? Or do you think it is just another approach that’s got its validity, but it’s not good enough on its own?

Justin Borgman: From our customer base, I have not seen a use case that was leveraging Cohesity or Rubrik as a data source, but we do see tons of object storage. So we have a partnership in fact with Dell, where Dell is actually selling Starburst on top of their object storage, and we do work with Pure and MinIO and all of these different storage providers that have made their storage really S3 compatible. It looks like it’s S3, and those are common data sources, but the Cohesity and Rubriks of the world, I haven’t seen that. So I’m not sure if the performance would be sufficient. It’s a fair question, I don’t know, but probably the reason that I haven’t seen it would suggest there’s probably a reason I haven’t seen it, is my guess.

Blocks and Files: Let’s take Veeam for a moment. Veeam can send its backups to object storage, which in principle gives you access to that through an S3-type connector. But if Veeam sends its backups to its own storage, then that becomes invisible to you unless you and Veeam get together and build a connector to it. And I daresay Veeam at that point would say, nice to hear from you, but we are not interested.

Justin Borgman: Yes, I think that’s right.

Blocks and Files: Could I take it for granted that you would think that although a Cohesity/Rubrik-style approach to providing information for RAG would have validity, it’s not real-time and therefore that puts the customers at a potential disadvantage?

Justin Borgman: That’s my impression. Yes, that’s my impression.