The storage analyst Enrico Signoretti popularised the term ‘flash and trash’ in 2015, to “describe a trend of two-tier storage built on latency-sensitive, flash-based arrays on one side and capacity-driven, scale-out systems on the other”. Three years on is this still a trend or was it a flash in the pan?
In essence, flash and trash is a two-tier infrastructure, with latency-sensitive apps and storage in the first category and capacity-based ones in a second category. Within that there could be multiple storage media tiers flash, disk, the public cloud and tape. The choice is driven by optimising capacity, cost and access latency.
It seems valuable – and commonsense – to have some unstructured data in the primary infrastructure tier. We’ll see how we can use these concepts to classify suppliers and products.
Two-tier data storage infrastructure
Signoretti has developed his thinking in a recent study: Two-tier Data Storage Strategy: A Gigaom Market Landscape Report (cost $299).
According to the abstract, “modern applications can now be divided into two families: latency-sensitive or capacity-driven.”
The publication seemed like a good opportunity to interview Signoretti about his thinking on two-tier storage. Here is our email conversation.
B&F: Why is the cost of traditional infrastructure incompatible with the exponential growth of unstructured data, big data workloads, or Internet of Things (IoT) applications?
Signoretti: It’s just a matter of numbers. All the modern storage systems we know, including object storage and scale out file systems, are all about scale and parallel access through the network. If you have to manage millions of sensors or thousands of video streams concurrently there is no way you’ll put them in what we used to call tier-1 (scale-up) storage systems.
It is simply unsustainable and unmanageable both from cost and operatively standpoints. And it is not a matter of media (flash or disk), it is all about how data is stored , distributed and then accessed.
B&F: You suggest that, with data becoming one of the most important assets for organisations, structured (primary -database- block) and unstructured (secondary – files) data are now equally important, and they should be protected and treated accordingly. Isn’t it true that not all unstructured data is equally important and that the important unstructured data has become primary data?
Signoretti: What I mean is that primary and secondary storage categorisation is no longer valid, now it is all about the data and it could be primary or secondary. And the classification is on its value not on the type. There are several solutions now in the market that are clearly meant to solve what was secondary storage but they are now focused on value of data rather then data protection only. Think of solutions like Cohesity, for example. They do not store data just for short or long term backup/archive. They also make it available to other applications, make it searchable and so on.
B&F: So the original primary/secondary classification is still true, albeit with both file and block now in the primary tier?
Signoretti: Yes, you have primary and secondary data, not storage. Are all the databases in your organisation equally important? I’m sure they are not. Therefore they shouldn’t share the same microsecond latency storage system.
But, at the same time, you want all of them fully protected and with all the data services you need to simplify your life. The cost of protecting data today is very low and sub millisecond-latency is for everybody.
And again, did you see how many high-performance storage systems for AI are available now? They are all-flash, scale-out and meant to work with unstructured data. Hard to find a categorisation for them… especially because they are part of strategic projects for these organisations.
Hard to classify storage on the type of data they store (structured /unstructured), it is way easier to identify a data classification depending on its value for your organisation.
B&F: Can you explain why you think a new primary/secondary classification has emerged and is based on the value of data, with data indexation and classification?
Signoretti: Enterprises are storing huge amounts of data but they do not have control over it. Petabytes and Petabytes of files dispersed all across the organisation, sometimes globally. If they do not go through an operation of data/metadata augmentation, it is quite impossible to know the real value of it. They spend money to save and protect it but it is not reusable.
More tools are now able to augment metadata during ingestion and others can do it on existing repositories. In the end, the goal is to make data searchable and reusable. GDPR and similar regulations are accelerating this kind of process but it goes beyond that. In fact, having the ability to analyse content and come up with insights or create new data sets for analytics applications is invaluable.
B&F: Is a two-tier storage structure enough? Don’t we need three or more tiers; fast flash, slow flash+fast disk, nearline storage, cloud long-term or tape?
Signoretti: Two-tier storage infrastructure could seem an oversimplification, but latency and capacity are the most important metrics to look for anyway. And I do not talk about a two-tier system, but infrastructure.
Think about a second tier based on object store, for example. It can include flash, HDD, cloud and tape. All these media tiers are embedded in a single infrastructure and managed as a whole to provide the right mix of data protection, performance and cost depending on the value of data you are storing in it.
B&F: You suggest that modern applications can now be divided into two families: latency-sensitive or capacity-driven. The first group needs data as close as possible to the processing engines (e.g. CPU, GPU, etc.) while the latter usually requires easily accessible data spanning multiple devices across the network. Can you provide examples of such applications please?
Signoretti: Think about AI and image recognition for example. When you need to train the neural network latency is everything. You need a large data set and latency is key to get the work done quickly. After that you save your pictures in a cloud repository, forever. Images are augmented with a description of the content when you save them, becoming searchable.
In this case the key is capacity because even if it takes a minute to create the description of your picture that’s ok. At this point you want to store as many pictures as you want and access them again no matter where you are. Capacity and network access, that’s all, and mobile network latency is way higher than any storage system latency.
B&F: Are any storage suppliers providing products or services spanning both primary (latency-sensitive) and secondary (capacity-optimised) data storage needs or do we need a different set of suppliers providing data management services across both categories?
Signoretti: The number of solutions are growing on a daily basis now. In my latest report I just gave a bunch of examples, but the market is evolving quickly and both startups and large vendors are working hard to provide solutions. Most of these take advantage of object storage in the back-end and offer traditional file or block access in the front-end. Data movement is managed by the front-end layer or an external appliance or service.