Data management business Diskover was described in part one of a look at the company. In this second part we look at how its technology works and what sets it apart.
The core component is a scale out, schemaless database located on-premises or in the cloud, with Elasticsearch or OpenSearch used for indexing and crawling. Agents called Ingestors bring metadata into this database from various data sources and additional agents or plugins send data upstream via pipelines to for use in data warehouses and lakehouses, visualizing tools and management utilities. Diskover is extensible and new plugins, both ingesting scanners and data selection and action agents, have been and are being developed.

Data sources can be NFS mounts, CIFS/SMB shares, local mounts, public clouds such as AWS and Azure, and apps and SaaS apps like Autodesk, Jira, Saleseforce and SharePoint. On-prem storage supplier sources include Dell (PowerScale/Isilon), Hammerspace, NetApp, Pure Storage and VAST Data. New ingest/index plugins can be and have been developed.
The data sources, which can be scaled out to any number of repositories and locations, are scanned continuously, every 15 minutes say, and in parallel to obtain intrinsic metadata which is housed in a single location to provide a global view. Extra metadata harvesting provides business context data, such as project details.

The indexed metadata can be used for (1) searches to find data, (2) analytics to detect data characteristics, and (3) actions to move data and deliver it to upstream locations and applications. These actions can be automated with the actions based on rules and metadata content. Thus data can be selected by filter, cleaned, renamed, curated, moved across storage tiers have sensitive data masked, be delivered to upstream processes, archived and deleted.
The analytics can be used by knowledge workers, business users and not just storage admin staff, and they access Diskover through a web UI.

Data destinations include CortexAI, Databricks, Dell’s data lakehouse, Power BI and Snowflake. Diskover says it’s a good idea to curate data before sending it to a data warehouse.
CEO Will Hall told us: “I don’t want everything in the data warehouse. I don’t want stuff that you have in your recycle bin. If I just dump everything I found on your storage, I’m going to get trash into my data warehouse. So I don’t want that. … By the way, that happens all the time. People are taking their temp files and their scratch base along with all the valuable data and dumping that into a [data warehouse or data lake] where it’s second only to [GPUs] as the most expensive compute you can possibly waste.”
Supported visualization apps include Alteryx, Grafana, Kibana and Tableau. They can consume an elastic search index as a data source. There is no need, as in the past, to find your data, export it to a spreadsheet, massage it there and then build a graph from it. Hall again: “We found that the big customers are piping that into a visualization tool and then they were piping it into a data warehouse. So the data warehouses can also use an Elasticsearch index as a data source.”
Diskover says its software is lightweight and has a small footprint, with low CPU and memory requirements. It says this is demonstrated by a perfomance impact document measuring the performance impact to production application performance while the diskover indexing process is actively indexing file system storage. Based on the test results, the Diskover indexer process resulted in no performance degradation to playback when using the Resolve application for playback on NVMe based storage or on nearline SAS-based storage either.
When data is sent upstream to data lakes, Diskover uses Parquet to multi-stream the data. CPO Paul Honrud tells us: “Take data from Elasticsearch into Snowflake. When you’re dealing with 80 billion files, it introduces delay, multi hopping my data, and it’s air prone because that’s a lot of data to be dealing with. … We found that if we could index the file system …we were creating a JSON blob and pushing the JSON blob into Elasticsearch. Once we had married some metadata to it, we could take and drop that into a Parquet-formatted S3 object that is consumable right by these things. And we could curate it. We could down select the data going into the data warehouse.”
Pipelines used to be specific in kind for vertical application areas such as entertainment and media, life sciences, and oil and gas. The surge in GenAI interest has made AI pipelining horizontal.
Diskover says its customer are getting bigger; “Well into six figures,” and growing fast requires cash. Hall told us: “We’ve never had VC funding. We’re doing a round right now and we’ve got some strategics involved in that.”

Honrud dislikes the idea of early-stage VC money: “if you take a bunch of VC money early, you spend it like a kid in an inheritance and the landscape is littered with these. They don’t know what to build. They burn through some money, then they realise they have to pivot. Komprise is right down this path. How many times did they pivot it? How much money have they taken to try to figure out where the market is? You have people like Igneous that didn’t survive. And so the VC funding early on I think distracts you from absolutely solving the problem for the customer.”
We were curious about how Diskover relates to Hammerspace, which it regards as a metadata source and not a competitor. Hall told us: “I think philosophically we align really well with Hammerspace. …It’s like if you have this global namespace, you still need to know where the relevant data is underneath that global namespace. And that’s where we are a nice kind of fit under something like that.”
In his view: “if you look at S3 as a protocol, I can store key value pairs native with the object. So the ability to store custom metadata with an object has been around with S3 for 10, 15 years [and] your general person is dropping data into S3 with no metadata.” Adding metadata requires human tagging and that doesn’t scale. There’s no automated metadata for the masses-type capability.
Honrud went on to opine that VAST and Hammerspace have moved that [KV pairs] over to the file system, adding that “Well I’ve had that with S3 and not a lot of people are taking advantage of it.”
Diskover can produce high-quality metadata in an automated way and that capability can be used by Dell, Hammerspace, NetApp, VAST Data and other supplier’s customers who need that metadata in order to automate data handling pipelines.
Hall said: “We sit in this kind of Switzerland world where we really aspire to be this kind of unstructured data platform that knows the most about your metadata and ultimately where you want to drop it off, meaning AI or data warehouses or what have you. … We’ve got a pretty cool little offering and we just haven’t told the world historically. Now we’re starting to try to tell the world.”