Onehouse launches vector embeddings generator

Data lake startup Onehouse is launching a vector embeddings generator to automate pipelines as part of its managed ETL cloud service.

Onehouse provides a fully managed data lakehouse designed to be universal – ingesting data at terabyte scale in minutes from any source and supporting all query engines and standards such as Iceberg and Hudi. Its products include LakeView, a free data lakehouse observability tool for the OSS community, and Table Optimizer, which automates data lakehouse table optimizations. Vector embeddings are symbolic representations of multiple aspects of text, audio, image, and video data that can be searched to find similar groups of vectors – enabling, for example, the location of images of a particular product or the generation of text in AI applications such as large language models (LLMs).

Its ELT service automates embeddings pipelines – continuously delivering data from streams, databases, and files on cloud storage to foundation models from OpenAI, Voyage AI, and others. The models then return the embeddings to Onehouse, which stores them in highly optimized tables on the user’s data lakehouse.

Vinoth Chandar, Onehouse
Vinoth Chandar

Vinoth Chandar, founder and CEO of Onehouse, stated: “AI is going to be only as good as the data fed to it, so managing data for AI is going to be a key aspect of data platforms going forward.”

Chandar is PMC chair of the Apache Hudi open source transactional data lake framework building project, with processes for ingesting, managing, and querying large volumes of data. He led the creation of Apache Hudi while at Uber in 2016.

He said: “Hudi’s powerful incremental processing capabilities also extend to the creation and management of vector embeddings across massive volumes of data. It provides both the open source community and Onehouse customers with significant competitive advantages, such as continuously updating vectors with changing data while reducing the costs of embedding generation and vector database loading.”

Onehouse believes the data lakehouse – with its open data formats on top of scalable, inexpensive cloud storage – is becoming the natural platform of choice for centralizing and managing the vast amounts of data used by AI models. Users are able to choose what data and embeddings need to be moved to downstream vector databases.

Onehouse graphic
Onehouse graphic

By adding a vector embeddings generator, Onehouse claims its customers can streamline their vector embeddings pipelines to store embeddings directly on the lakehouse. This provides all of the lakehouse’s capabilities around update management, late-arriving data, concurrency control and more, while scaling to the data volumes needed to power large-scale AI applications.

Onehouse integrates with vector databases, such as Pinecone and Zilliz, to enable high-scale, low-latency serving of vectors for real-time use cases. The data lakehouse stores all of an organization’s vector embeddings and serves vectors in batch, while hot vectors are moved dynamically to the vector database for real-time serving. This architecture provides scale, cost, and performance advantages for building AI applications such as LLMs and intelligent search.

Onehouse screenshot
Onehouse screenshot

Kaushik Muniandi, engineering manager at consumer market research business NielsenIQ, was quoted in the Onehouse announcement: “Text search has evolved dramatically. The traditional tools have complications on their own, as in, ingress of data and egress when we would want to move out. Vector embeddings on data lakehouse not only avoids the ingress and egress complexities and cost but also can scale to massive volumes. We found that vector embeddings on data lakehouse is the only solution that scales to support our application’s data volumes while minimizing costs and delivering responses in seconds.”

Readers interested in seeing vector embeddings for AI built and managed in the data lakehouse can join an upcoming webinar with NielsenIQ and Onehouse: Vector  Embeddings in the Lakehouse: Bridging AI and Data Lake Technologies. It takes place on August 27 at 10am Pacific Time (03:00 UTC).

Bootnote

Onehouse was founded in 2021 and has raised $68 million in funding, with a June B-round contributing $35 million, a year after a $25 million A-round. It uses the term ELT (Extract, Load, and Transform) in its communications material rather than the more common ETL (Extract, Transform, and Load).