Home Blog Page 236

Pure Storage PortWorx Q&A: Why storage needs a data strategy

Murli Thirumale

Paid Feature Murli Thirumale is VP and general manager of the Cloud Native Business Unit at PortWorx, now part of Pure Storage, and he told Blocks & Files why organisations need to have a data strategy – not just a storage strategy

Blocks & Files: There are lots of companies providing storage for containers. And it seems that on the storage front there are two approaches. One will hook up an external array with the CSI interface, as Dell EMC has recently done. Another is to actually have cloud-native storage facilities within Kubernetes and storage is just another type of container, a system container. Mixing of virtual machines and containers also seems to be a sensible thing to do, but it sometimes seems like the two things are fundamentally incompatible, and you can make a stab at putting an abstraction layer across the two of them. But really, you need to go all in on one or the other. What is your view on that?

Murli Thirumale: I’d like to take a stab at that question from the customer in, rather than the vendor out. Storage itself is evolving, moving on to data management, and  on to a new way of thinking about how enterprises need to win with data.

Let’s step back 20 years or so, and we can see the start of cloudification. The big change was the cloud and moving from capex to opex,. But technology always drives changes in the underlying hardware infrastructure. So, not only was the hardware infrastructure being cloudified, it was also being upgraded – you had Cisco Nexus top-of-the-rack switches, you had HCI happening. 

The next phase is when SAP and all of those guys came to the fore, and it was the world of apps. The value changed to apps, and Software-as-a-Service. Today the world of apps and data is about automating these apps, and this has led to containerisation.

Now, people are not just being responsive or competing by going fast. They are competing by being smart with their data – this is the “data is the new oil, data is the new currency” argument.

Being smart is not just a question of using your own data – I am not talking about data mining HDFS data lakes, here. This is about conflating your data with the world’s data. To take one example, one of our customers is a COVID vaccine company, and they were able to do fast data science models. They were comparing publicly available information versus their own private tests. So, it’s conflating the two that allows you to gain insight. Now, let’s think about Uber, which is nothing but conflating a lot of publicly available GIS information with its own private information about the driver and about where I want to go.

So in this enterprise journey, the world of storage and CIO has moved from thinking about cloudifying infrastructure to automating apps. That is where the puck is now.  But mining data, both real time and batch wise, to gain insight – that’s where the puck is going.

How does the world of storage add value and not just turn into those old storage admins working away in the basement of enterprises and never seeing daylight? The reality is the world has moved on to apps, people and DevOps. So how does storage cope in this world? The answer lies in migrating to a data strategy.

So we can’t just be furniture guys concerned about wardrobes, drawers and boxes?

What does storage do? Storage is storing data, data management automation. And what we would do today is about freeing that data from one place and making it multi-cloud and multi-app and all of that. But in the end, you have to actually mine the data itself for insight.

So what is going to happen in the world of storage? At the bottom is, of course, the infrastructure. And nowadays, there is a software-defined storage overlay that has been overtaken by a Kubernetes storage overlay.

Now companies like Portworx or Robin.io, and there’s a host of other people, whether it’s the Cohesives of the world or other people who are in data management, this is about that automation layer. We have taken Kubernetes and we’ve taken that data, freed it from the array, and made it available across the cloud, across containers across different apps. But data management, this is the bulk of our business today.

Now when I say data, people think of data as one thing. But in my book, data is actually five things. Data is consumed as a service now, by these applications. So the app tier is at the top, but the first thing is databases, because databases are how data is stored. The second thing is data search – Elasticsearch, in particular. And I’m going to talk about these as services, because that’s really where the world is at. The third thing is analytics, and analytics can be an Excel spreadsheet, but it could be Tableau and come through the old-style analytics. And then there is AI and ML which is unique. Why? Because  this requires a different parsing of the data. It’s really GPU-based, TensorFlow, those types of things. Then finally, streaming, in which I include messaging. So I would merge these two boxes – streaming is really about having distributed data right out there, IoT, and stuff like that, or sensors of different kinds.

So these are the five data services. And this is actually the whole array of modern app solutions. It’s MongoDB, it’s Elastic, it’s Cassandra, it’s Kafka and Spark. This is not the old-style siloed world of Oracle and Sybase – which still exists. But this is the new world, where infrastructure is cloudified. Data is now all running on containerised apps. That’s the cloud-native world. But in addition, data is being consumed as a service in these five different sub-segments.

It looks like a stack. And it looks like the traditional place for suppliers like Pure is at the bottom, but that Pure is moving up the stack to provide services there. And the implication I’m drawing from what you’re saying, is that Pure Portworx will move even further up?

Pure is going to be in all these layers. These things are not mutually exclusive. And in fact, Portworx is an example of how we’re actually stitching these together. And in the future, there’s no reason why we couldn’t have a vertical slice that goes all the way, and even ties-in the app as well.

But you would probably do that with partnerships, wouldn’t you? Because of the amount of code you have to write doing that?

Exactly. So this is what I think a CIO needs to do from an industry viewpoint. But we’re not doing this on our own. This is not such a secret anymore, but people think of Kubernetes as being the container orchestrator. And they’re right, that is the primary role of Kubernetes.

But now, I believe there’s the second coming of Kubernetes, and this is really as an infrastructure control plane. It’s a multi-cloud infrastructure control plane. Sometimes Kubernetes is orchestrating infrastructure through the help of CNI. That’s what [Tigera’s Project Calico] does. I’m also using CSI extensions of Kubernetes and orchestrating storage. That’s what PortWorx or StorageOS [now Ondat] or Robin.io do. And then it will also be orchestrating VMs in the future, using KubeVirt, which is a new emerging technology that is gaining some currency. It’s still a technical concept, but I think more and more, you will see compute being orchestrated by Kubernetes.

That’s astounding. I’ve been with you up until now, but the idea of compute being organised by Kubernetes …

Well, there is this CNCF incubated technology called KubeVirt and it’s basically a way to orchestrate VMs using Kubernetes. You stand up VMs and then you can manage them just like you would containers, but now instead of containers being orchestrated, you’re instantiating and doing things like moving VMs and moving containers within VMs.

This is still in its infancy, but I think it’s going to happen. And this may sound a little bold, but I would say Kubernetes is really going to replace the vision of what OpenStack was intended to do. OpenStack was going to be this abstraction layer that allowed people to manage across any infrastructure, their storage, networking and compute. In storage it was Cinder and Swift, and so on.

My view is, it was so complicated and poorly done that it kind of crumbled. Of course, there are probably 150 companies using OpenStack that still swear by it. But these were mostly people who put a lot of effort into developing the standard. But in reality, the era of OpenStack is over. OpenStack was intended to be the universal way to manage infrastructure, and that’s what is happening now in a multi-cloud way with Kubernetes, with extensions to Kubernetes, called CNI, CSI, and then KubeVirt.

Do you see Kubernetes getting involved in composing datacentre IT resources?

Exactly. The old world is a machine-defined world. That’s how VMware was when the focus was on infrastructure. But now the focus is on as-a-service. Forget infrastructure – people want to consume services. So how do you shorten the path to something we consume as a service? You orchestrate it with containers and Kubernetes.

Look at PortWorx, which is an amazing example of this. Our buyer is not the storage admin, our buyer is a DevOps person. And eventually, with PortWorx Data Services, our buyer is going to be a line of business person.

Because you’re supplying services, not hardware boxes or software, you’re supplying services?

Thirumale: Yes, they’re consuming a service. So Kubernetes was conceived as an app organising framework, and so it is naturally already set up to be app-oriented, but it’s also consuming. So this is data services as code. You had infrastructure as code, software as code, now you have as-a-service as code.

But you don’t have to go there. Pure could remember it is a storage company. What’s in it for Pure to move up the stack to this as-a-service control plane and provide service-level applications up there?

I’m not saying we’ve left our data management world behind. That’s the bulk of our revenue as PortWorx and it’s growing. But this is a brand new thing we launched In September and it’s called PortWorx Data Services. Basically it’s a one-click way to deploy data services. Think of this as a curated set of data services, and over the next year there will be 12 to 14 of those.

Our analysis has revealed that these data services are probably about 75 to 80 per cent of what is being deployed out there in the modern kind of app world. It’s not about  siloed  infrastructure stovepipes – this is the modern multi-cloud world. And what we will offer is essentially a one-click way to do it.

On day one, we’ll let you deploy them with a single click. We actually have curated operators that allow database sizing, so we’ll do basic database sizing. And it will start with a default that we’ve known over time with our experience. And you can just download it – it’s a containerised version of Couchbase, or a containerised version of Cassandra. And we will have an open source version initially, but in future we might also have partnership licences from the vendors.

You won’t be providing the equivalent of Couchbase or Redis, or Kafka yourself? What you’re providing is the facility for consuming them as a service?

Yes, this is a database-as-a-service platform. If I were to be grandiose, I would say it’s like an app store for databases. When I go on my phone, and I go click on the app store, Apple just provides me a way to get Facebook or to get Google Maps. So, remember the old walled garden phrase? This is kind of a walled garden for data services.

But we’re doing more. We’re not just providing you the ability to provision it. That’s the day one part. But we will now allow you to optimise deploying it on a multi-tenant infrastructure. One of the challenges we found is that people might understand how to run Redis, but they won’t know how to pick the instance size to get the IOPS optimised. And they sure as hell have no idea what to do when a container fails and they have to move to a different cloud or how to migrate it.

And then day three is backing it up and archiving it right through the lifecycle. So what PortWorx Data Services is really doing is using Kubernetes in its new avatar as a service manager. Underneath the covers, a line of business person does not care that it’s Kubernetes. They may not even know – the point here is Kubernetes becomes invisible.

We’re not going to them and saying “Kubernetes, Kubernetes!” We’re just saying to them, you can consume a Postgres endpoint, consume a Redis endpoint, here’s Elasticsearch as a service. So our customer is really still a DevOps person, but one who is now going around offering these five data services to their line-of-business customers as a self-service model.

Sponsored by Pure Storage.

Veritas chasing hyper-automation with NetBackup 10

Veritas has launched v10.0 of its NetBackup data protection product along with an Autonomous Data Management strategy.

It says Veritas is planning for a future where its technology is able to provision, optimise and repair data management services autonomously, with users self-servicing data protection and recovery. This will be based on so-called CloudScale Technology, which harnesses artificial intelligence (AI) and hyper-automation – Veritas’s term – to self-provision, self-optimise, and self-heal in web-scale multi-cloud environments. 

CEO Greg Hughes issued a statement: “Hackers are increasing the impact of their ransomware attacks by targeting cloud services and data. Veritas is laying out its strategy for how we solve that challenge for our customers, starting with tools, available today, that will help to reduce cloud footprint and costs, keep data safe from ransomware, and pave the way to Autonomous Data Management.”

We are told Cloud Scale Technology enables a containerised, programmable, and AI-powered microservices architecture that provides autonomous unified data management services across any cloud.

Coming a year after v9.0, NetBackup v10 features:

  • Enhanced multi-cloud storage and orchestrated tiering capabilities, including deep support for Amazon Web Services and expanded support for Microsoft Azure, to reduce the cost of backup storage by up to 95 per cent (no comparison base supplied). NetBackup 10 supports all major Kubernetes distributions and provides multi-cloud cross-platform recovery. 
  • Users can recover the data they want to any Kubernetes distribution. It also has new automated detection and protection for more platform-as-a-service workloads, including Apache Cassandra, the Kubernetes distributions and Microsoft Azure Managed SQL and Azure SQL.
  • Its deduplication capabilities have been upgraded (no details) and it uses elastic multi-cloud compute services to reduce costs.
  • V10 provides automatic malware scanning during backups and prior to restores to ensure infection-free recovery of data. Its AI-driven anomaly detection can automatically initiate malware scanning.
  • NetBackup SaaS Protection is now integrated with NetBackup 10 to provide a single-pane-of-glass view of a customer’s data protection estate for governance and compliance purposes.
  • V10 includes a new integrated no-cost base version of NetBackup IT Analytics, formerly known as Veritas APTARE, to provide AI-driven analytics and reporting.
  • V10 supports Azure Blob object storage as well as S3.

A Conquer Every Cloud micro-website contains more information.

Veritas’s strategy is to respond to customers’ growing use of more applications in multiple cloud environments and of more cloud-native applications and SaaS apps by extending NetBackup’s data protection services across these environments and so provide a single comprehensive backup capability. This will be easier to manage because it will have a degree of autonomy – with AI-triggered malware scanning, an IT estate-wide protection view, and self-scaling and -healing features.

Private equity-owned Veritas has to show that it can sustain itself and grow against competition from Commvault, Veeam, Cohesity, Druva, HYCU and Rubrik. V10 should help it retain its existing customers and even gain some new ones but it won’t enable Veritas to damage its competitors much. The main name of the game is inroad prevention, not invade and conquer.

Merkle Tree

Merkle Tree – a hierarchy of blocks starting from data blocks at the bottom and progressing upwards to the root with a set of hash blocks. Hashes are unique and each is a fingerprint of the data from which it is computed. Each hash block can link downwards to a base data block or, at an upper level in the tree, to two hash blocks at a lower level and upwards to a single hash block. At the bottom layer, every data block has a hash value computed and it is placed in a corresponding hash block and so on as we move from left to right. The next level of hash blocks each references two lower hash blocks and have hash values concatenated from the two connected lower blocks. The root block has its hash values computed by concatenating the hash values of the blocks one level down that are linked to it. The root hash is effectively a fingerprint of the entire data set. In a file by file comparison the validity of the data in the data blocks can be verified by checking the root hash values of the two files.

If a data block is altered or deleted then its hash value will differ or vanish and this can be checked by looking at one or more of the tree of hashes above it oir the root. Merkle trees are an underlying component of blockchains and also bitcoin. Apache Cassandra uses Merkle trees to detect inconsistencies between replicas of entire databases.

Wasabi scores Liverpool Football Club goal

LIVERPOOL, ENGLAND - Thursday, February 10, 2022: Wasabi branding during the FA Premier League match between Liverpool FC and Leicester City FC at Anfield. (Pic by David Rawcliffe/Propaganda)

Liverpool Football Club is storing data in Wasabi’s cloud as Wasabi becomes its Chief Storage Partner.

Wasabi sells object-based Hot Cloud Storage, launched in May 2017. Last November it signed up as a sponsor for the Fenway Bowl – an annual college football bowl game at Fenway Park, Boston, with teams from the American Athletic Conference (AAC) and Atlantic Coast Conference (ACC). This Liverpool FC deal is a roughly equivalent sponsorship outside the USA and Wasabi is hoping that a sponsorship deal with such a globally famous club will enhance its competitive branding status against the three public cloud titans.

Drew Crisp, SVP of Digital at Liverpool FC, said as much in his announcement quote: “As Wasabi looks to continue its exponential growth and provide a highly competitive cloud storage alternative in this rapidly expanding market, it wanted to partner with a brand that has huge global reach and could provide the platform to drive brand awareness and engagement. With LFC’s huge global audience and marketing capabilities, we’re delighted to provide the brand platform that Wasabi needs.”

Michael Welts, chief marketing officer, Wasabi Technologies, followed suit: “This is an historic moment in the evolution of the Wasabi brand. Not only does Liverpool Football Club have a fanbase and global reach that is second to none, but the club is also driven to be the technology trailblazer in an increasingly data-driven sports world. Wasabi’s work with LFC will be a game-changer in the trajectory of our company, and we look forward to creating many memorable and exciting opportunities alongside the biggest sports brand in the world.”

Wasabi customers and prospects can look forward to sales pitches taking place at exec suites at the Liverpool FC stadium at football matches.

Wasabi Technologies was founded in 2015 and has taken in around $275 million in funding. Its main product marketing pitch is that its cloud storage is 1/5th the price of AWS with no complex tiers or unpredictable egress fees. 

Nvidia plus Excelero equals acquisition – maybe

We have heard that Nvidia could be buying Excelero, an NVMe storage software startup – and Excelero did not deny it.

Nvidia is the GPU  server giant that just withdrew its attempt to buy Arm. Excelero makes and sells NVMesh software which provides a virtual, distributed flash array supporting converged (in-server SSDs) and disaggregated (all-flash array) architectures. 

When we asked about the acquisition rumor, Excelero CEO Yaniv Romem responded: “Decline to comment.”

Excelero was started up in Israel in 2014 and has raised more than $40 million in funding – possibly $45 million – from VCs and other investors including Micron, Qualcomm and Western Digital. Mellanox put in an undisclosed amount in May 2019, shortly after being acquired by Nvidia in March that year. Excelero’s founders were CEO Yaniv Romem, Chief Scientist Omri Mann and Engineering VP Ofer Oshri.

Israel’s The Marker media outlet suggests the acquisition price is less than the total funding amount. If true, that would be a disappointing exit for Excelero’s founders and backers. Nvidia declined to comment on The Marker‘s questions.

E8, a similar NVMe storage startup led by Israelis, was bought by Amazon in July 2019 for between $50 million and $60 million.

By buying Excelero, Nvidia would gain NVMe storage software technology which it could use in or alongside its GPUDirect host server CPU/DRAM bypass software to bring data from external servers and arrays to its GPUs, and also its BlueField SmartNICs.

Amp It Up: Snowflake boss pens hands-on CEO toolkit

Frank Slootman has written another book about how to run a business based on his time at Data Domain, ServiceNow, and Snowflake.

Our first thought was “not again” – he co-wrote Rise of The Data Cloud last year.

But this was quickly set aside because Frank appears to walk the walk. He’s had stints at Data Domain, bought by EMC for $2.4 billion in 2009; ServiceNow, which ran a $210 million IPO in 2012; and his current Snowflake gig following its IPO last year, which raised $3.36 billion at a valuation of $33.6 billion. If you’re a VC, CxO, EVP, SVP or VP-level exec at a startup that’s failing to reach its full potential, Slootman might be someone you look up to.

He writes in a way that grabs your attention, with chapter titles like “My Journey from Teenage Toilet Cleaner to Serial CEO” and clearly written anecdotes to illustrate his points.

The book is organised in five sections: Raise your standards; Align your people and culture; Sharpen your focus; Pick up the pace; and Transform your strategy.

In each, Frank makes it all seem simple. Diagnose and define the problem, then fix it with celerity. He’s like a corporate plumber. Blocked pipes? HR recruiting flatlining sales people instead of gunslingers? Leaking faucets? Finance starving sales of investment? Inadequate header tank? Too many passengers and not enough drivers in the organisation? Diagnose and fix.

Hitting targets

Throughout the book Frank lobs hand grenades at his business bêtes noire – such as consultants who borrow your watch, charge you money for telling the time, and then keep your watch. Ah, the old ones are the best ones.

Newer objects of his ire are fun to read too, such as customer success departments which he found at both ServiceNow and Snowflake: “They were happy to follow the trend set by other companies like ours. But not me. I pulled the plug on these customer success departments in both companies, reassigning the staff back to the departments where their expertise fit best.”

VPs of strategy are another target: “In many large companies, it’s common to see dedicated strategy roles, often at the VP level. These people are basically in-house consultants because they have no operational responsibility.” For Frank, “operators in charge of each business unit must also be the strategists for their business, and the chief executive officer must also act as the chief strategy officer.”

Notable quotes

“But the simple reality is that not all businesses are destined to succeed … Slow-growing companies become the walking dead.”

“Silicon Valley is littered with companies lingering in the proverbial chasm for years and years.” (Shades of extreme inbetweeners here?)

“When it costs much more than a dollar to generate a dollar, you don’t really have a business.”

Data Domain: “The sale to EMC was a great outcome by any economic standard, but a CEO can’t help thinking he or she aborted the mission when a company gets sold.”

ServiceNow: “This was a platform, not a tool. A tool is a one-trick pony, but a platform is broadly capable of many different uses.”

Snowflake: “Snowflake was massively over-resourced. … My introduction of a much more serious and disciplined approach was like pushing the staff into a cold shower to wake them up.”

Slootman is a hands-on guy and this book is a highly readable and enjoyable tour through his hands-on CEO toolkit. It’s a pity that Wiley didn’t amp up its production quality, though – a humdrum cover makes a second-rate impression.

The book was written with the help of Will Weisser and is on sale at major retail and online book stores. 

Veeam Data Protection Trends: Assume you will be breached

Veeam’s 2022 Data Protection Trends report says that users now operate in a hybrid multi-cloud world, face rabid ransomware attacks, and need more data protected – and to have that data recovered more quickly, increasingly by SaaS processes and from the cloud.

The two main strands this report brings out for us are that, first, data protection is becoming a combination of backup and security, and second, the public cloud enables more economic and flexible protection.

The findings are the result of a research firm surveying more than 3,000 IT decision makers and IT professionals about their data protection drivers and strategies heading into 2022. The report has a large and worldwide enterprise bias as most respondents were from 1,000-plus employee organisations across 28 different countries.

There were three highlights picked out by Veeam, the first being that respondents ran their IT in a hybrid and multi-cloud (hyperscale and MSP) environment. 

Question: What do you estimate is your organisation’s percentage of servers in each format currently and what do you anticipate the percentage will be in two years’ time?

We can see from Veeam’s chart above that the relative percentage of on-premises physical servers, virtual machines, and hosted virtual machines is stabilising with an approximate 50:50 split between on-premises and hosted systems.

The researchers asked respondents if they thought their data protection SLAs were good enough and if enough data was being protected at the right frequency. Virtually all respondents believed there was an “availability gap” between the SLAs expected and how quickly IT can return to productivity, and also a “protection gap” between how much data they can afford to lose and how often data is protected.

The third highlighted item is that the difference between high-priority data and the rest is narrowing:

Just over half the respondents suffered a ransomware-caused outage, with cyberattacks causing the most outages, and only 36 per cent of data was recoverable after a ransomware attack.

We thought the finding that “4 of 5 organizations do expect to use either cloud-storage or a managed backup service – i.e. Backup as a Service (BaaS) – within the next two years” was interesting, as was the rise of cloud-based disaster recovery.

This chart from page 8 of the report shows a clear rise in cloud-based disaster recovery is anticipated

The underlying messages here are that enterprises need more data protection and know it, and that SaaS approaches and the public cloud will play a greater role. The 28-page report is an interesting read for everyone and not just a piece of sales collateral for Veeam products. Prepare your data protection defences. Assume they will be breached and prepare and rehearse your cyberattack recovery processes so you can get up and running quickly after an attack.

Storage news ticker – 21 February

Consumer and enterprise cloud file sharing business Dropbox reported Q4 2021 revenues of $565.5 million, up 12.2 per cent year-on-year, and with a net profit of $124.6 million compared to a net loss of $345.8 million a year ago. Full-year revenues were $2.16 billion, up 12.7 per cent, with a $335.8 million profit – quite a turnaround from the year-ago $256.3 million loss. Total operating expenses in Q4 were $379.5 million in contrast to the year-ago $744.9 million. The drop was due to a change in real estate assets as a result of its decision to shift to a Virtual First work model. It was Dropbox’s first full year of (GAAP) profitability.

The International Space Station is storing Celestium – a digital artwork by Cecilie Waagner Falkenstrøm and her tech team at ARTificial Mind, who say it is a first of its kind artificial intelligence (AI) artwork created in space. Celestium is an AI and blockchain artwork and part of a global STEAM initiative led by startup Lonestar, open-source leader Canonical, and leading space mission integrator Redwire Corporation. 

Celestium

The whole thing is quite insane. We’re told that, by harnessing data from cosmic radiation, the AI algorithm is seeding and mining thousands of space images and space tokens to be distributed on a blockchain. Drawing upon images from NASA’s deep space photographs, the unique works of astronaut Nicole Stott of the Space for Art Foundation, and western artist Tamara King, each image is generated by the AI as one of a kind and embedded as a non-fungible token.

Microchip Technology has announced that the market’s first Gen-4 automotive-qualified PCIe switches are in production. These Switchtec PFX, PSX, and PAX switches provide compute interconnect capabilities for Advanced Driver-Assistance Systems (ADAS). It says PCIe is emerging as the preferred compute interconnect solution for the automotive industry because it provides ultra-low latency and low-power bandwidth scalability to CPUs and specialised accelerator devices. Microchip’s ChipLink diagnostic Graphical User Interface (GUI) provides debug, diagnostics, configuration, and forensics tools for rapid deployment.

An old TNW story from May 2012 surfaced on Twitter and recounted how a 400-strong team-wide permissions policy caused Pixar’s Toy Story 2 product files to be deleted in 1998. Then the animators found their LTO backups didn’t work properly – and the log file about the tape being full was in the full tape and contained zero bytes! Eventually most of the files were recovered from an offsite workstation needed by a supervising technical director for work at home. The moral is: test your backup files regularly.

TerraMaster T9-423

China’s TerraMaster has a new nine-bay T9-423 NAS system in tower format designed for SMB customers. It has a 2.0GHz (2.9GHz boost) Intel Celeron N5105 quad-core processor, with integrated GPU and 8GB of DDR4 dual-channel memory that is expandable up to 32GB, and comes with two 2.5GbitE ports with support for Link Aggregation, delivering speeds of up to 283MB/sec.  It has three silent fans to keep the storage disks cool and working in optimum performance, operating silently even when working at full load, TerraMaster says. It can be used for file storage servers, mail servers, web servers, FTP servers, virtual machine hosts, and other enterprise-level applications.

Streaming fast: Druid’s event-based database practices for sub-second trillion row response

Druidic Wickerman
Wikipedia public domain image: https://en.wikipedia.org/wiki/Druid#/media/File:The_Wicker_Man_of_the_Druids_crop.jpg

Analysis: Netflix uses an open-source Druid analytics database to understand and quantify how users’ devices are handling browsing and playback. Other Druid users include Airbnb, Alibaba, Cisco, eBay, Lyft, Paypal, Pinterest, Reddit, Salesforce, Shopify, Slack, Splunk, Tencent, Trustpilot, Walmart, Wikipedia, Yahoo, Xiaomi and many more.  

The streaming platform uses it to get real-time answers from queries to a database with billions of items to filter and scan. It’s not a relational database, a NoSQL database nor a data warehouse or lakehouse and, although it’s specialised, it is in wide use. Articles about Ocient and Firebolt have touched on this topic of real-time querying of vast datasets, so it’s becoming important in our storage arena. Let’s try to understand the Druid magic to get a better handle on things.

Origin

An adtech firm called Metamarkets originally devised Druid as a distributed and real-time data store in 2011 to provide SaaS analytics. Metamarkets provided interactive analytic dashboard facilities for advertisers to find out how programmatic ad campaigns were performing. It was acquired by Snap in 2017. Druid was moved to an Apache license in 2015.

A Metamarkets blog describes its original needs: “Our requirements were driven by our online advertising customers who have data volumes often upwards of hundreds of billions of events per month, and need highly interactive queries on the latest data as well as an ability to arbitrarily filter across any dimension – with data sets that contain 30 dimensions or more.  For example, a typical query might be ‘find me how many advertisements were seen by female executives, aged 35 to 44, from the US, UK, and Canada, reading sports blogs on weekends’.”

It needed real-time access to a store containing billions of time-series events. Data warehouses were ruled out because their scan rates were too slow and caching speed up was defeated because Metamarkets needed to make arbitrary drilldowns. Relational databases were also unsuitable: “RDBMS data updates are inherently batch, updates made via inserts lead to locking of rows for queries.”

A NoSQL approach was no good as “pre-aggregations required hours of processing time for just millions of records (on a ~10-node Hadoop cluster). … as the number of dimensions increased, the aggregation and processing time increased exponentially, exceeding 24 hours.” 

That meant no real-time queries.

So Metamarkets decided to build its own data store and make it hugely scalable by parallelising processes within it. It featured a distributed architecture, real-time data ingestion, the ability to query both real-time and historical data, and a column orientation with data compression for scan speed. Speed was also facilitated by using bitmap indices with so-called Concise Compression.

Druid can be scaled out to perform scans of 26 billion rows per second or more and it can ingest up to 10,000 records per second per node. The servers in a Druid cluster can be scaled up or out, which provides more scaling capacity than scaling up.

Druid structure

A Druid deployment uses commercially available servers and these are called nodes. There are three basic types: master node, data node and query node. Old data is stored in so-called deep storage which is pluggable storage outside Druid such as Ceph, NFS, Hadoop (HDFS) or S3. The nodes run processes and these could run in dedicated servers and be called nodes – or servers:

  • Master Node – Overlord, Coordinator, Zookeeper, Metadata Store processes;
  • Data Node – Middle Manager/Indexer, Historical processes;
  • Query Node – Broker processes and web console.

We have seen several different diagrams showing how these nodes and processes are related to each other as data is ingested, indexed, sent to (deep) storage, and queried. We’ll show four and then attempt an explanation.

 

Netflix Druid diagram
Knoldus blog Druid diagram

 

Apache Foundation Druid diagram which foresakes the “node” term in favour of “servers
Medium Druid diagram

As you see, though the diagrams share terms like Overlord and Zookeeper, they are all different.

Data ingest

Let’s try to make sense of what’s going on, and start with data ingest which happens in the data node or server layer.

Streaming data from Kafka, Kinesis, Tranquility or other sources can be ingested as can batched data from Hadoop, S3, an RDBMS and, again, other sources. The Middle Manager nodes, sometimes called Ingestors or Real-Time nodes, handle this and segment or partition the data by time (configurable duration) into an immutable compressed columnar file which is written to deep storage. 

There are three possible column types: timestamp, dimension (field name) or metric (numeric data derived from other fields). The Middle Managers also index and pre-aggregate the data before creating segments. All incoming data within a specified time interval is written to the same segment in sequential fashion. Druid uses a flat data model, and not a nested data model.

Once the data is in deep storage then Historical nodes (Historicals) are responsible for fetching it in response to queries.

Master nodes

The master nodes form a Druid cluster control plane. The Overlords do data load balancing over the Middle Managers. The Coordinators do data load balancing across the Historical processes or nodes, informing them when data segments need to be loaded from deep storage, evicting segments when no longer needed and getting segments replicated.

Metadata storage contains all the metadata created by the Middle Managers in a relational database such as MySQL or PostgreSQL. The Zookeeper process is an Apache project used for current cluster state management, internal service discovery, coordination and cluster leader election.

We need to understand that Coordinators are told by Metadata Storage when new data segments have been created by a Middle Manager. It then gives it to a Historical so it can be written to the underlying filesystem. At that point it is evicted from the Middle Manager. The Middle Managers keep their pre-aggregated segment data in memory while the Historicals write the new segments to deep storage. Once there, segments are read-only.

If a Middle Manager node fails them the Zookeeper gets its dataset reconstituted by the remaining Middle Manager nodes.

Query nodes and query process

There is a single type of query node, called a Broker, and a web console is also located in this layer of Druid. Brokers can be scaled out to increase real-time query processing capability.

Brokers receive incoming queries, in a JSON query language or via Druid SQL, from client applications and assign them to relevant Middle Managers and Historicals – the ones which contain (can serve) these data segments. The Broker splits the incoming query up into sub-queries and sends them to the identified Middle Managers and Historicals for execution.

The Historicals fetch the requested data from deep storage while the Middle Managers fetch any real-time data segments.

When the results are returned they are merged and output to the client.

The net result of this is that Druid can enable querying of enormously large data sets which can include real-time as well as historical data. It began life as an adtech real-time query/analytics database facility but is finding use in applications needing a similar ability to query massive datasets constantly ingesting streamed data from event logs in real time. For example, clickstream analysis, APM (Application Performance Management), supply chain, network telemetry, digital marketing, and risk/fraud analysis.

As Metamarkets states, “Queries over billions of rows, that previously took minutes or hours to run, can now be investigated directly with sub-second response times.” Druid can cope with trillions of rows of data in multi-petabyte data stores.

I feel Druid and Druid-type applications are going to become more widely used as streams of digital event data are analysed to optimise processes and so fine-tune enterprise processes involving billions, if not trillions of events over time.

Druid is available for download on GitHub, and more information can be found on the Druid project website.

RAG

RAG (Retrieval Augmented Generation) is a technique for enhancing the accuracy and reliability of generative AI models with facts fetched from external sources. An Nvidia blog says RAG: ” fills a gap in how LLMs (Large Language Models) work. Under the hood, LLMs are neural networks, typically measured by how many parameters they contain. An LLM’s parameters essentially represent the general patterns of how humans use words to form sentences. .. [This][ makes LLMs useful in responding to general prompts … However, it does not serve users who want a deeper dive into a current or more specific topic.”

RAG is used to link generative AI services to external resources, and: “the technique can help models clear up ambiguity in a user query. It also reduces the possibility a model will make a wrong guess, a phenomenon sometimes called hallucination.”

Embarking on your digital transformation journey? Make sure you fix your data first

HPE GreenLake
HPE GreenLake

Paid Feature Data is the lifeblood of the modern organisation, essential for fuelling those higher level functions like data driven analytics and AI. And, just as with humans, if the systems that maintain, refresh and circulate that lifeblood are neglected, the outcomes are…well, not great.

Consider recent research by ESG for HPE, which shows that “data first” organisations – those with the most mature data strategies – are typically 20 times more likely to beat their competitors to market, and more than 11 times more likely to beat their revenue goals by 10 per cent or more. Data first organisations also have much happier data analysts and scientists.

So what defines a “data first” organisation? They’re the ones who have confronted the complexity of data management – freeing them up to actually do something more useful with their data. Over 93 per cent of respondents to ESG’s research see complexity holding up digital transformation, with 59 per cent saying complexity is “trending in the wrong direction”. The result is that three quarters of organisations are unable to keep pace with business requirements.

Typical warning signs for data complexity include running multiple storage platforms, each with their own management tools, which becomes a progressively tougher problem at scale. Likewise, companies find they need myriad complex provisioning tools, and face ongoing problems in meeting service level agreements for their applications. And when it comes to addressing these problems, they have to navigate glacial capex procurement processes and – at a time of supply chain snarl ups – drawn out delivery schedules.

This firefighting might sound like business as usual to many tech pros. But the fact is, says HPE Storage’s Worldwide Product Marketing Manager, Simon Watkins, “they’re spending too much time managing the infrastructure, as opposed to really leveraging the data.”

The public cloud at first glance might appear to be the perfect solution. Cloud promises all the infrastructure you want or need, with straightforward, predictable pricing. That same ESG research points to how IT leaders see the problem playing out, with more than nine in ten agreeing that “mature cloud operations on-premises is the single most important step to eliminate complexity”.

It is certainly the case that LOB (lines of business) staff and app owners as well as IT leaders and admins appreciate the simplicity of managing hardware in the cloud. Updates and patching are no longer their concern, provisioning is theoretically much more straightforward, and capacity is unconstrained. And, once the initial transition is made, procurement becomes much less problematic. No big ticket capex cycles, no more supply chain delays, and no more over-provisioning.

But experience has taught many companies that things are not quite that simple. Cloud pricing might appear straightforward. But platform providers’ byzantine pricing changes regularly, while the very ease of consumption can undermine procurement discipline. When hidden extras are worked in, that predictable pricing can translate into monthly sticker shock, especially when hefty egress charges are levied.

Keep the data flowing

Equally importantly, many companies are not in a position to move all, or even some, of their data to the cloud. The reasons range from concerns about data gravity and latency to exposing IP, to unavoidable data sovereignty or privacy restrictions. So, there has been no shortage of offerings aiming to reproduce the cloud experience for on-prem, or hybrid environments. To offer storage “as a service” so to speak. But there is another catch.

Often, the focus is on the consumption part of the equation. On a purely hardware basis, this typically includes the ability to unlock extra capacity when needed, utilising built-in buffers. Beyond the box, these solutions are offered on a consumption-based pricing model, effectively making them opex rather than capex buys. This goes some way to providing more predictable costs, as well as providing other financial benefits – and partially addresses the supply chain conundrum.

But looking back at ESG’s research it’s worth remembering that most IT and business professionals hanker for the experience of “mature cloud operations”. That means much more than being able to buy on-prem capacity more easily.

“What we’re hearing loud and clear is that customers love the operational simplicity of the public cloud,” says Watkins. But how they pay for it is just part of the appeal. What they really want to bring on-prem is “the automation, the flexibility, the agility.”

In other words the novelty of being able to unlock additional capacity quickly wears off if IT admins must still use multiple management tools to provision and manage different classes of storage.

Likewise, tech managers need to know that this “storage as a service” can run any workload without compromise, whether that workload is a legacy application that can’t be lifted and shifted to the cloud, or a modern application that needs to access resources on-prem or in the cloud. Having infrastructure on tap is not a great benefit if it still takes a day and a half to work out which is the best storage for a new workload.

HPE looks to satisfy this far broader requirement for cloud-like experiences with the HPE GreenLake platform which offers a wide portfolio of on-premises cloud services – spanning compute, networking and storage as well as database, HPC, containers, AI and VDI and third-party applications. HPE GreenLake for storage features a portfolio of on-premises cloud data services, offered in mission critical, business critical or general purpose options. 

With HPE GreenLake for storage, data infrastructure is delivered on-premises, and charged on a consumption basis, which is calculated on the storage actually used after the data has been deduped and compressed. This satisfies those dual needs for control and a more flexible and predictable procurement methodology. 

Abstracted, not distracted

Under HPE GreenLake for storage, the management of the storage infrastructure is abstracted into the cloud. This means that admins get a single console to manage their entire fleet of storage infrastructure, whether distributed across their real estate or in co-lo spaces. This allows them to simply manage, or offload, day-to-day provisioning and management activities such as patching, and set policies and SLAs centrally right across their estate.

Moreover, this is underpinned by AI, in the form of HPE InfoSight. As Watkins explains, this AIOps software for infrastructure offers full stack visibility, right up to the application layer. Maintenance is less a matter of ongoing firefighting, more a question of predictive support automation and continuous optimisation.

The platform also offers policy management and intent-based provisioning, so that deciding where an application should be deployed to meet required service levels is no longer a combination of exhaustive analysis combined with an element of hope and pray. Rather, admins can specify the workload, the capacity and performance needed and the key policies, and the service will automatically recommend the best-suited system across a fleet to optimise SLAs.

This is particularly important as tech demands are increasingly driven by line of business imperatives, says Watkins. “They’re the ones that really like the agility of the cloud when it comes to spinning up your infrastructure and they’re the ones who get frustrated with the slowness that they get on-prem. They have to wait too long to have new infrastructure, or get new apps provisioned.”

With HPE GreenLake for storage, he says, this bottleneck is removed. “Whether it’s a modern app or traditional one, with capabilities like intent-based provisioning, you can cut storage provisioning times from days or weeks to minutes, and automatically ensure that workloads are provisioned on the right platform – with no storage expertise required.”

The sum of all this? One easy metric to grasp is increased availability, to the extent that HPE GreenLake for storage offers 100 per cent availability guarantee on its mission critical data service with HPE Alletra 9000 systems. More broadly, Forrester Research suggests that HPE GreenLake for storage can help slash time to deliver projects by 75 per cent, a massive advance in a world where customer satisfaction and speed to market is everything. Likewise, Forrester expects 30 to 40 per cent reduction in total cost of ownership, due to the elimination of overprovisioning.

But arguably the biggest benefit comes with the time savings that a centralised, cloud-like management of your entire storage infrastructure delivers – and potentially, of the other infrastructure and services accessible through HPE GreenLake To take one example, cited by Watkins, “LOB and database admins are happier because they can now move faster…e.g. they can self-provision storage infrastructure without the need for domain expertise.”

Time saved can be devoted to other projects that will deliver on the potential of that data and unleash the digital transformation and innovation that every organisation covets. And if it all means that storage pros’ own blood pressure evens out a little, well surely that’s a good thing too.

Sponsored by HPE

Storage news ticker – 18 February

Boston-based startup ChaosSearch has said it more than doubled its headcount in 2021, and expanded into Europe. ChaosSearch is focused on making it easier for companies to store, access, and analyse vast amounts of data by eliminating upfront prep and turning cloud storage into an analytical lake. The company, headed up by CEO Ed Walsh, plans to expand its platform’s capabilities with the addition of machine learning and cloud support for Azure, and growing its talent pool across engineering, marketing, and sales.

Dell has published a revised ten-page Principled Technologies report, which reveals difficulties in implementing HPE GreenLake compared to APEX. According to the report, “The process of acquiring the APEX Data Storage Services solution was quick and easy, while the HPE GreenLake storage solution acquisition had several delays and issues. We received pricing immediately through the APEX Console, while it took 11 days to receive a price quote for the HPE GreenLake storage solution. In just 13 days from purchase, our APEX Data Storage Services solution was up and running, while the HPE GreenLake storage solution wasn’t ready for operation until 95 days after purchase.”

Druva has commissioned Forrester Research to run a 31-page Total Economic Impact study on the results of using the Druva Data Resiliency Cloud SaaS backup service. No surprise that customers saved a lot of money by using the Druva service. For example: “Upon implementing Druva, interviewees experienced significant reductions in their capex and opex line items, which offered interviewees a range of business value.” Look at the study document here.

Ionir, which sells a cloud-native storage and data management platform for Kubernetes, has become a CircleCI technology partner. CircleCI provides continuous integration and continuous delivery/deployment (CI/CD) software. According to Ionir, its CircleCI partnership means Ionir customers can accelerate CI/CD deployments for cloud-native workflows with Accelerated data movement, greater capability, less complexity, repeatable, scalable testing, testing with real-life data, and combined storage and data management. 

Chinese-American distributed OLAP startup Kyligence has come up with definitions for seven industry buzzwords that should be common this year. They are: Metrics Store/Headless BI, Data Lakehouse, Intelligent Data Cloud, Data Catalog, Data Fabric, Data Mesh, and Data as a Product. Read the definitions here.

Wells Fargo analyst Aaron Rakers has highlighted an Economic Daily report – in Chinese – that notifications from Micron suggest that NAND contract prices have risen by 17–18 per cent and spot prices are up as much as 25 per cent following the chemical contamination of the Kioxia/WD Kitakami NAND manufacturing plant.

Nodeum says it has a system that can be used to import archives from Quantum’s Scalar LTFS Appliance, which is approaching the end of its life.

101 Data Solutions announced a collaboration with Seagate Technology to provide mass-capacity mobile storage and data transfer services with Seagate’s Lyve Mobile edge storage and data transfer service.

DDN subsidiary Tintri announced 42 per cent global revenue growth from the first half of 2021 to the second, including a double-digit revenue increase from net new logos, fuelled by an enhanced global executive sales team and continued VMstore innovation. Tintri has released an update to both TxOS – the software that runs on VMstore appliances – and the Tintri Global Center (TGC) management plane. The payload of the new release breaks down into five categories: bug fixes and security, user interface improvements, additional hardware support, NFS 4.1 beta, and vSphere tags support. Read about it in a blog post.