Komprise explains the mess: files, objects, silo sprawl and abstraction layers

As if in a dream, I thought: suppose we didn’t have file and object storage? Would we invent them? Now that we do have them, should they be united somehow? In a super silo or via a software abstraction layer wrapper? Where  should functions like protection and life cycle management be placed?

Who could I ask for answers? Komprise seemed like a good bet. So I sent them a set of questions and COO and president Krishna Subramanian kindly provided her answers.

Krishna Subramanian

Blocks & Files: If we were inventing unstructured data storage now, would we have both files and object formats or just a single format? And why?

Krishna Subramanian: The issue is one of cost versus richness. File hierarchies have rich context and are performant, but more expensive because of all the overhead. Object storage is more cost-efficient with a flat design, but slower. The need for different price/performance options is only increasing as data continues to grow, so both formats will be needed. However, objects are growing much faster than file formats because of their usefulness for cost-effective, long-term storage.

There is a use case for access for both file and object and it reflects the life cycle of a file/object. Typically, when a file is first created it is accessed and updated frequently: the traditional file workload. After that initial period of creation and collaboration the file still has value, but updates are unlikely and the workload pattern shifts to that of object. The organic workflow we see is that unstructured data is created by file-based apps, whereas for long-term retention and secondary use cases such as analytics, object is the ideal format. 

Object’s abilities to handle massive scale, provide metadata search, deliver durability, and achieve lower costs are advantageous. An additional benefit of object storage is offloading data from the file system, allowing it to perform better by removing colder data. File storage systems start to bog down as space utilization approaches 90 per cent.

Are files and object storage two separate silo classes? Why?

Today, file and object storage are separate silo classes, because not only do they present and store data in different formats, but also the underlying architectures are different. For instance, object storage has multiple copies built-in whereas file storage does not, which impacts how you protect files in each environment. The way apps use file versus object is also different. Object data sets are often so large you search and access objects via metadata. With file you provide semi-structure with directories.

Other differences include:

  • File is more administratively intensive in regard to space management and data protection.
  • With objects these tasks are largely handled by the cloud provider. Space is only constrained by your budget and data protection is based on multiple copies, erasure coding and multi-region replication versus daily backup or snapshots with file.

What are the pros and cons of storing unstructured data in multiple silos?

The value of data and its performance requirements vary throughout its lifecycle. When data is hot, it needs high-performance; when data is cold, it can be on a less performant but more cost-efficient medium. Silos have existed throughout the life of storage because different storage architectures deliver different price/performance. 

However, there are still logical reasons for silos as they allow IT to manage data distinctly based on unique requirements for performance, cost, and or security. The disadvantages of silos are lack of visibility, potentially poor utilization, and future lock-in. Data management solutions that employ real-time analytics will play an ever-stronger role in giving IT organizations the flexibility and agility of maintaining silos without incurring waste or unnecessary risk.

Is it better to have a single silo, a universal store or multiple silos presented as one via an abstraction layer virtually combining silos?

Silos are here to stay — just look at the cloud. AWS alone has over 16 classes of file and object storage, not including third-party options. Silos are valuable for delivering specific price/performance, data locality, security features for different workloads but they are a pain to manage. The future as we see it is this: the winning model will be silos with direct access to data on each silo and an abstraction layer to get visibility and manage data across silos. 

Less ideal is an abstraction layer such as a global namespace or file system that sits in front of all the silos, because you have now created yet another silo and it has performance and lock-in implications. This is why you do not want a global namespace.

Rather you want a global file index based on open standards to provide visibility and search across silos without fronting every data access. Data management which lives outside the hot data path at the file/object level gives you the best of both worlds: the best price/performance for data, direct access to data without lock-in and unified management and visibility.

Should this actual or virtual universal store cover both the on-premises and public cloud environments (silos)?

A virtual universal data management solution should cover on-premises, public cloud and edge silos. Data at the edge is still nascent, but we will see an explosion of this trend. 

Modern data management technologies will be instrumental in bridging across edge, datacenters and clouds. Customers want data mobility through a storage-agnostic and cross-cloud data management plane that does not front all data access and limit flexibility.

Should the abstraction layer cover both files and objects, and how might this be done?

Yes. The management layer should abstract and provide duality of files and objects to give customers the ultimate flexibility in how they access and use data. Komprise does this for example, by preserving the rich file metadata when converting a file to an object but keeping the object in native form so customers can directly access objects in the object store without going through Komprise. They can also view the object as a file from the original NAS or from Komprise because the metadata is preserved. 

Should the abstraction layer provide file sharing and collaboration and how might this be done? Would it use local agent software?

Data management can expose data for file sharing and collaboration, but it is better for data management to be a data- and storage-agnostic platform that enables various applications including file sharing, collaboration, data analytics and others. 

By abstracting data across silos, app developers can focus on their file sharing or other apps without worrying about how to bridge silos. Our industry has been moving away from agents because they are difficult to deploy, brittle and error prone.

Could you explain why the concept of object sharing and collaboration, like file sharing and collaboration, makes sense or not?

Object sharing is less about editing the same object, but more about making objects available to a new application like data analytics. This speaks to the lifecycle of data and data types. File data that requires collaboration — such as documents and engineering diagrams — will be accessed and updated frequently and are best served by file access, while longer-term access is best served by object. 

For example, an active university research project may create and collect data via file. Once the project is complete the research director can provide read-only access to the object using a preassigned URL, which wouldn’t be possible with a file alone.

If it doesn’t make sense does that mean file and object storage are irrevocably separate silos?

They are separate because the use cases are different, the performance implications are different, and the underlying architectures are different. I would draw a distinction between the data and how it’s accessed and used versus silos. There is value today in providing object access to file data, but perhaps no value in providing file access to data natively created in object. 

An engineering firm creates plans for a building using file-based apps, and after that project is complete the files should not be altered. Therefore, access by object makes sense. On the other hand, image data collected by drones via object API should be immutable through its entire life cycle. Providing access via file would provide limited benefit and be extremely complex with very high object counts, etc.

Should the abstraction layer provide file and object lifecycle management facilities?

Yes. Data management should provide visibility across the silos and move the right data to the right place at the right time systematically. Lifecycle management is critical. Many of the pain points of file data such as space, file count limits and data protection are growing beyond what can be managed effectively by humans. 

Old school storage management largely consisted of monitoring capacity utilization and backups. This was largely reactive: “I am running out of space. Buy more storage.” 

Proactive, policy-based data management can alleviate many of these issues. Object lifecycle management is often about managing budgets: your cloud provider is perfectly happy to keep all your data at the highest performance and cost tier.

Does the responsibility for data protection, including ransomware protection, lie with the abstraction layer or elsewhere?

Enterprise customers already have existing data protection mechanisms in place so the data management layer can provide additional protection, but it must work in concert with the capabilities of the underlying storage and existing backup and security tools. If you require customers to switch from existing technologies to a new data management layer, then it’s disruptive, and again creates a new silo.

Data protection features such as snapshots and backup policy for file or immutability and multi-zone protection are characteristics of storage systems. Effective data management is putting the right data on the right tier at the right time. A great example is moving or replicating data to immutable storage like AWS S3 object lock for ransomware protection; data management puts the data in the bucket and object lock provides protection.

How should — and how could — this abstraction layer provide disaster recovery?

Data management solutions can replicate data and provide disaster recovery often for less than half the cost of traditional approaches because they can leverage the right mix of file and object storage with intelligent tiering across both. Ideally, hot data is readily available in the event of a disaster, but costs are lower.

What’s important to note here is the data is not locked in a proprietary backup format. Data is available for analytics in the cloud and its availability can be verified. A big challenge for traditional/legacy approaches to disaster recovery was the need to “exercise” the plan to make sure data could be restored and understand how long that would take. Since data is available natively in the cloud when using storage-agnostic data management solutions, disaster recovery can be verified easily and data can be used at all times.

How can unstructured data be merged into the world of structured and semi-structured analytics tools and practices?

The need to analyze unstructured data is growing with the rise of machine learning, which requires more and more unstructured data. For instance, analyzing caller sentiments for call center automation involves analyzing audio files which are unstructured. But unstructured data lacks a specific schema, so it’s hard for analytics systems to process it. As well, unstructured data is hard to find and ingest as it can easily be billions of files and objects strewn across multiple buckets and file stores. 

To enable data warehouses and data lakes to process unstructured data, we need data management solutions that can globally index, search and manage tags across silos of file and object stores to find data based on criteria and then ingest it into data lakes with metadata tables that give it semi-structure. Essentially, unstructured data must be optimized for data analytics through curation and pre-processing by data management solutions.