How VAST Data could realize Thinking Machines’ vision

Comments by VAST Data co-founders CEO Renen Hallak and CMO Jeff Denworth suggest the company is going to develop data infrastructure software that can help fulfill the vision of long-defunct supercomputing outfit Thinking Machines.

This is a long way on from just storing bits of data in the clever way that VAST’s Universal Storage accomplishes.

Hallak’s thinking

Renen Hallack, VAST Data
Renen Hallak

In May last year Hallak told a writer from Protocol Enterprise that VAST wants to run its own data science platform, with a trajectory he says will pit it against vendors including Databricks.

“We think that five years from now… that infrastructure stack needs to be very different. It needs to enable AI supercomputers rather than the applications that we have in the past… Vertical integration adds to simplicity. But more than that, it allows you to take full advantage of the underlying technology.”

Hallak implied that VAST Data would seek to build most of the platform itself: “It would not be possible for us to just buy someone else… and strap it on top of our system. We always lean towards doing the critical parts ourselves. And if there are any peripherals that aren’t as important… then maybe there would be room for acquisitions.”

“There is a massive opportunity to compile different data services into one product suite,” he added.

Denworth’s prognostications

Denworth told Computer Weekly in November last year: “In the next 20 years we’ll see a new class of application. It won’t be about transactional data, it won’t be about digital transformation. What we’ll see is computing coming to humans; seeing, hearing, and analyzing that natural data.

“We are conscious of not computing in one datacenter, and using unstructured and structured data. We are also conscious that data has gravity, but so also does compute when you get to the high end.”

This will mean a new computing framework with “very ambitious products” to be announced by VAST.

In January, Denworth told The Next Platform: “Thinking Machines was a very bespoke supercomputing company that endeavored to make some really interesting systems and over time. That’s ultimately what we’re going to aim to make: a system that can ultimately think for itself.”

He added: ”We realized that we could take that far beyond the classic definitions of a file system, but the realization was that the architecture that has the most intimate understanding of data can make the best decisions about what to do with that data. First by determining what’s inside of it. Second of all, by either moving the data to where the compute is or the compute to where the data is, depending upon what the most optimized decision is at any given time.”

The Beeler video podcast

Denworth told Brian Beeler on a Storage Review video podcast: “The next 20 years could [see] something that we call the natural transformation, or computers start to adapt to people… Our realization is if you rethink everything at the infrastructure level, there are gains that can be realized higher up the stack that we will take the world to over the next couple years.”

“Computers are definitely at a point where they can now do the sensory part of what humans could do before; they can see, they can hear, they can probably not smell so much, but understand natural information closer and closer to the way that humans understand it. And I think the leap again, from that to having thinking machines, may be a big one, maybe a smaller one. But once you get to a thinking machine, it’s game over, you don’t need anything beyond that. 

“And so I think it’s justified, that we’re putting all of our resources at building infrastructure that is enabling that next wave. And I think we will be surprised at how far we can take this in terms of what’s possible.”

He talked about organizations working in different parts of the stack: “We have, obviously the hardware vendors working on GPUs, we have vendors like us working on that middle of the sandwich infrastructure part and software, we have the application vendors working on life science, genomics, medical imaging, we have financial institutions, taking advantage of all types of information coming into their systems, it’s really exciting.”

Data arrival is going to be driving activity: “I think things are getting flipped on their head if before you had an application, and it was reading data, either from memory or from storage, in order to manipulate it, and then it was writing the result that it understood to be the case, I think the more and more we’re going to see data driven applications, the data itself as it flows into the system will trigger functions that need to be run on it based on different characteristics of that information. 

“And then you’ll have recursion of more and more functions that need to be run as a result of what we understand on this specific piece of information as we compare it to the rest of the data that we already have stored specifically with respect to GPUs,” said Denworth.

“I think the fact that we’re called VAST data is a big clue. We are trying to build that next generation of data infrastructure.

“People will see us expand in the storage space and get closer and closer to realizing the true vision of universal storage of our customers not needing to think about where they placed their data and how much access they have to it, and what can be done with it.

“And in parallel, you’ll see more and more not necessarily storage parts coming from us as well, based on feedback that we get from customers.“

VAST will “essentially work to help customers solve the whole of their data processing, machine learning deep learning problem in a hybrid cloud world, in a way, where we just take not just the complexity of tiering and things like that as considerations and take them off the table… And this seems to [be] becoming more and more popular, as people start to understand some of these natural language processing models, some of these new computer vision, or computer audio models. And so that’s, that’s pretty exciting. We’ve got a lot that we’re doing with Nvidia.”

Thinking Machines and Databricks

Thinking Machines was a supercomputing company started in 1983 to build highly parallel systems using that era’s artificial intelligence technology. The aim was to chew through masses of data much more quickly than serial computing and so arrive at decisions in seconds or minutes instead of days or weeks.

The company over-reached itself and crashed in 1994, with parts being bought by Sun Microsystems. Its architecture typically required a front-end server, and back-end Sparc CPUs and vector processors

In February last year Blocks & Files wrote: “Databricks enables fast SQL querying and analysis of data lakes without having to first extract, transform and load data into data warehouses. The company claims its ‘Data Lakehouse’ technology delivers 9X better price performance than traditional data warehouses. Databricks supports AWS and Azure clouds and is typically seen as a competitor to Snowflake, which completed a huge IPO in September 2020… Databricks’ open source Delta Lake software is built atop Apache Spark.”

The VAST future

VAST Data is going to build a data infrastructure layer vertically integrated with its existing storage platform to form what today would be called an AI supercomputer. This layer will provide data lake capabilities and be able to initiate analytics processing itself; the data as it flows into the system will trigger functions that need to be run on it.

VAST CTO Sven Breuner previously confirmed this, saying VAST will link customer’s separate VAST systems together: “It’s now time to start moving up the stack by integrating more layers around database-like features and around seamlessly connecting geo-distributed datacenters.”

We think that VAST will use a lot of Apache open-source software, such as Spark like Databricks, Druid like Imply, and Kafka like Confluent.

VAST is looking at hearing, speech, and vision applications and will use Nvidia hardware, such as the  Grace and Hopper chip systems. We are sure that penta-level cell flash and the CXL bus will play a part in VAST’s storage and infrastructure roadmap.

It will present its IT infrastructure systems, both on-premises and in the public cloud – in a bid to help customers solve all of their data processing and deep learning problems in a hybrid cloud world. We think VAST will not port its Universal Storage software to the public cloud. The Cnode software could be ported easily but the Dnode structure (storage-class memory drive front end with NVMe QLC SSD backend drives) could be hard to replicate with the appropriate storage instances in the public cloud.

B&F thinks it’s likelier there will be a VAST system in a public cloud available to the CSP’s customers directly or indirectly.

Our understanding is that VAST will announce its 10-year roadmap direction at an event later this year.