Startup provides AI data as a service from its own abstracted metadata

Startup View has built what it says is a fast-to-deploy AI stack starting with semantic cells in multi-type data sources, a Universal Data Representation (UDR), its own object storage, along with quick ingest, search and analysis through large language model tech.

It says that customers can use its AI data as a service, and by doing so seamlessly link all their data sources, of virtually any source, format, or file type, to be input to the AI with no migration needed. They can “find data assets and uncover hidden insights with hybrid search” and “effortlessly answer questions” about their proprietary data while keeping it secure, claims View.

This set of capabilities appear to represent similar end-user capabilities to those of Dell, DDN, Pure, NetApp, WEKA, VAST Data and other NVIDIA partners involved involved in Gen AI large language model (LLM) querying and analysis of proprietary data sets but with no need to alter existing storage arrangements. We’ll describe Veew’s startup and base technology here and look at what it means, and what users can do with it, in a second article.

Joel Christner

Beginnings

View was started up in April 2024 by California-based CEO Joel Christner and Chief Product and Revenue Officer Keith Barto, and has raised seed funding. Christner was a distinguished engineer and director on Dell’s Office of the CTO from March 2019. Before that he was involved as a senior exec with four startups, all acquired, notably chief scientist at cloud storage developer StorSimple, acquired by Microsoft in 2012. Prior to that  he had two stints at Cisco between 2002 and 2009.

Christner has been involved at partner level in a couple of venture funds from November 2021 onwards, perhaps aided by his four startup acquisition exits, and also part-time chaired a couple of SNIA DNA storage workgroups and a contributor to the .NET Foundation.

Keith Barto

Colorado-based Keith Barto has led global sales and sales partner teams at Cisco and NetApp, and was a senior consultant at Xiotech back in the 2003 to 2005 period. He founded Immersive Partner Solutions in 2013, being its president and CEO until its acquisition in 2017 by NetApp.

We were briefed by Christner and Barto and taken though the technology. It starts with data ingest into its own version of an S3 object storage vault, and the creation of unique metadata about the ingested data which is part of an AI-focussed data management architecture.

Data management for AI

Christner told me: “Prior to starting View, I was leading research at Dell in the office of the CTO, and I was focused on things that were five to seven years out. So we’ve been working on large language models, generative AI, sentence transformers, DNA, data storage, silicon photonics, all kinds of things that were, you know, way out in the future. … One of the areas that I focused on was data management.”

He said he realized that ”the challenge with data management is that the farther up the data stack you go, the more bespoke things become.”

But “the way that you look at data has to shift into a lens that is really finely tuned for your organization, which makes it hard to build horizontal data management.” View has developed a horizontal data management scheme with a lens enabling LLMs to looking a customer’s own data without having to build complex pipelines to vectorize existing corporate data vaults with their petabytes of data.

Christner added: ”The market is so highly fragmented because of that, in that if you plant your flagpole down on one tool, let’s say you pick Alation as your data catalog, or Apache Atlas, or you pick Tableau as your visualization and connector platform, you start to limit the surface area the other tools that you can use. So to build an end-to-end pipeline that goes from source data over to a curated pool of metadata that you could use for discovery, analytics, machine learning, providence, and governance is impossible [or] it’s very difficult.”

According to Christner, this means that: “Enterprises are in a bad spot because they don’t have a solid data management strategy. They don’t know where their data resides, what it contains, who owns it, how it’s being used, what its features are. So AI is making the data management problem multiplicatively compounded.”

Built-in RAG

Christner says View’s software is a response to this problem: “View is an end-to-end platform that accelerates a customer’s AI journey by connecting their private source enterprise data to an AI experience internally behind their firewall… It has an entire … industry-leading –   – RAG pipeline baked into the platform.”

At this point I’m wondering what, if View takes in the same documents, PDFs, presentations, files, objects, etc., as other RAG-assisted LLMs then what what does View do that makes a difference?

Christner again: “There’s a number of things that we do, and we’ve got close to a dozen patents filed just on our core IP. … We have three primary ways that we can ingest data. … We have our own object storage platform that we’ve built from the ground up.” That can be loaded using the AWS CLI, Cyberduck or some other S3-compatible tool.” 

View screenshot showing CIFS Server crawl plan admin window. [iPhone photo off Zoom session.]

A second method is to use View’s REST API “to interact with every component of the system, including data ingestion. And the third is through our data crawlers. So if you want to point us at repositories, like files, object, etc., you can point us at those repositories, and we’ll periodically crawl for new data.” 

Semantic cells

Now a processing pipeline kicks off. Christner explained: “The processing pipeline involves about seven or eight steps. The first thing we do is we determine the type of the content. Is it a PowerPoint document? Is it a JSON object? Is it a data table from a SQL database query? Is it PDF file? Parquet? We’ve got a long list of content types that we support. The second is we open the source content and consume it, and we identify what are known as semantic cells.”

What is a semantic cell? Imagine a data asset such as a PDF document. Christner said a semantic cell is “a region within a data asset that has correlated meaning. So for instance, this header up at the top of the PDF might be a semantic cell. This text header might be a semantic cell. This image might be a semantic cell. This paragraph here might be a semantic cell, and this paragraph might be a semantic cell. So it’s just regions within a data asset that have that are likely correlated together in some way.”

Screenshot of View response to a request for Botox information from a proprietyary document source set. [iPhone photo from Zoom briefing session.]

I’m thinking this is like a super-token describing an abstracted part of the PDF document; identifiable areas within a data file. The file could be an image, audio or text, etc., and the semantic cells describe components of that image, text or audio.

Universal Data Representation

The next step is, Christner said, to “find the regions of the document that have a high likelihood of containing correlated information. The third step; we generate what we call a UDR, or universal data representation document. UDR is one of the first patents that we filed, and it is a homogenous representation of heterogeneous data from heterogeneous data sources.”

 “To boil that down, we extract features from the document since we know its type, including things like the key terms. We infer the schema. We create a flattened representation of the data. We build an inverted index, and [altogether] we do about eight or 10 different things.”

Does the UDR contain the semantic cell data? “UDR absolutely contains the semantic cells. But. It contains a separate list of the key terms that we’ve identified in the document, and what the schema for the object is, what the position of every word is, and relative to other positions of other words. Okay; imagine building a Google index over an individual document. That’s  effectively what we’re doing with UDR.”

What this means is that: “We have a generic representation of that object, and that representation is consistent, whether it is a PDF file, a JSON file, a PowerPoint presentation. And the main reason we do that is because, once you boil that down to that common representation, we now have a single form that we can query across content types.”

There’s more: “Beyond semantic cells and UDR, we create graph representations. We store this metadata in a data catalog that we built because we have the ability to do some search types of functions that the market just does not have yet.”

Then: ”We had to build our own search and data catalog platform, and, of course, we generate embeddings, and you can use whatever model you want.” 

The graph representations are not knowledge graphs. Christner explains: “When I think of knowledge graphs, I think of ontologies, and we don’t really have an ontology, per se. We’re using graph to map relationships amongst UDR documents to source documents to repositories, to owners to semantic cells.”

View now has homogenous abstracted metadata from heterogeneous data types and formats, and that can be used to create subsets of the data, to search this data and to analyze it.

****

We will look at how View’s system uses this data in a second article