Startup DataPelago has revealed its Universal Data Processing Engine (UDPE), software that accelerates compute for data analytics and GenAI models.
DataPelago – a portmanteau of “data” and “archipelago” – says it’s creating a new data processing standard for the accelerated computing era to overcome performance, cost, and scalability limitations of existing software architectures and x86 CPUs. The UDPE uses open source Gluten, Velox, and Substrait to turbocharge Spark and Trino, providing customers with “disruptive price/performance advantages.” It integrates into existing data stores and lakehouse platforms, SQL, Python, Airflow workflow automation, Tableau, Power BI, and more with no need for data migration and no lock-in.
DataPelago was founded in 2021 by CEO Rajan Goyal and chief product officer Anand Iyer. Goyal was CTO at DPU startup Fungible, acquired by Microsoft for around $190 million in 2022. His backstory includes hardware/software co-design at Cavium. Think of the UDPE as a quasi-software DPU.
DataPelago has raised more than $75 million in VC funding through a 2021 seed round for $8 million, a 2022 $20 million A-round, and a $47 million venture round just this month. The latest round involved Eclipse, Taiwania Capital, Qualcomm Ventures, Alter Venture Partners, Nautilus Venture Partners, and Silicon Valley Bank, a division of First Citizens Bank.
Goyal stated: “Today, organizations are faced with an insurmountable barrier to unlocking breakthrough intelligence and innovation: processing an endless sea of data … By applying nonlinear thinking to overcome data processing’s current limits, we’ve built an engine capable of processing exponentially increasing volumes of complex data across varied formats.”
He told an investor that he would proceed by “first building a software-based query plan that leverages inherent data knowledge to compute adjacency and storage to compute proximity. On top of this, he described his vision of creating networked query engines that could be used to achieve massively parallel query execution.” These engines would support CPUs, GPUs, FPGAs, and other accelerating hardware.
UDPE refactors data processing to exploit accelerated computing, leveraging higher degrees of parallelism and a tightly coupled memory model to deliver orders of magnitude of higher performance.
It has three component layers:
- DataVM – a virtual machine with a domain-specific Instruction Set Architecture (ISA) for data operators providing a common abstraction for execution on CPU, GPU, FPGA, and custom silicon.
- DataOS – operating system layer mapping data operations to heterogeneous accelerated computing elements and managing them dynamically to optimize performance at scale.
- DataApp – pluggable layer that enables integration with platforms including Spark and Trino to deliver acceleration capabilities to these engines.
The company claims its UDPE is suited for use cases that are resource intensive, such as analyzing billions of transactions while ensuring data freshness and supporting AI-driven models to detect threats at wire-line speeds across millions of consumer and datacenter endpoints, and providing a scalable platform to facilitate the rapid deployment of training, fine-tuning, and RAG inference pipelines.
UDPE is not a storage engine. A spokesperson told B&F: “A storage engine (like Speedb) is used to write data to and read data from storage drives and is written in low-level code. A storage engine cares about data placement on the storage rather than the semantics of query or data processing requests. DataPelago is a data processing engine for GenAI and analytics workloads. DataPelago sits higher up in the technology stack. It focuses on processing data processing queries/requests and leaves the placement of the actual data to the underlying storage layer, which would include technologies such as the Speedb engine.”
It “introduces enhancements that automatically map operations to the most suitable computing hardware – be it CPU, GPU, FPGA, or others – and dynamically reconfigures these elements to maximize performance for the target hardware … [It] requires no custom hardware and works off standard accelerated compute instances available in the cloud from hyperscalers such as AWS, Azure, and GCP as well as the new GPU cloud providers such as CoreWeave, Crusoe, Lambda, etc. All of this happens seamlessly for users, requiring no changes to queries, code, applications, workflows, tools, or processes.”
Goyal writes about his experience founding DataPelago here.