Case Study. Quobyte unified block, file and object storage is being used by the HudsonAlpha Institute for Biotechnology to store primary life sciences and genomics data in a hybrid disk+flash system designed to provide both high capacity and high performance.
Investigating the effects of DNA on people, plants and animals is proving to be a useful way to understand diseases and their treatment, improve crop yields and our overall quality of life. It needs intricate and complex research and has high and growing data storage needs which are testing the IT systems of research organizations, as HudsonAlpha found out.
Nonprofit HudsonAlpha is based on a 152-acre campus at Huntsville, Alabama, and was set up in 2008. Its 200+ employees conduct genomics-based research that aims to improve human health and well-being, and there are more than 50 biotech companies present on its campus and using its services. It says it is using the power of DNA to help solve some of society’s most pressing challenges in human health and plant science.
Annual genomics data volumes are rising rapidly. According to Statista, the genomics field created 1 petabyte (PB) in the five years from 2009 to 2014, but went on to generate 20PB in the next five years.
HudsonAlpha’s IT setup consists of two main storage systems, each with a capacity of approximately 4.5PB. One of these is used for production and “hot” projects, and now runs on a clustered Quobyte software system. The other is used for archiving and runs on a separate object storage appliance. HudsonAlpha does little with cloud-based storage, save for what it terms “very cold storage” because of its large data volumes.
HudsonAlpha system architect Richard Johnson told Blocks & Files: “Our production cluster is made up of 26 storage nodes each with 22 x 16TB hard disk drives, 2 x 8TB SSDs, and a 1.6TB NVMe drive, and connected to the network with dual 100GbE ports.
“We are using NMVe for metadata and using SSD to accelerate performance with small, mostly read-intensive files. In the genomic space, there is a mix between very large highly compressed data files and very small text files. Solid state drives work better for the small files but are way too expensive to handle the capacity required by the larger ones.”
Legacy parallel filesystem
HudsonAlpha decided it needed a new primary data storage system because its existing parallel filesystem software was getting too difficult to manage and limiting. Johnson said: “We moved away from a legacy parallel file system because of the complexity and level of effort to operate and maintain along with limited hardware scaling options.”
Another legacy system issue was that it only offered rudimentary on-screen reporting, with limited ability to monitor data conditions and manage them in real time. The genomics group constantly struggled against not having a way to talk directly to its legacy parallel filesystem, wanting more direct, immediate awareness and control over its data.
What legacy file system was this? Neither Quobyte nor HudsonAlpha would say. We note from an August 2018 Scientific Computing World article that “Data In Science Technologies (DST) has been providing HPC compute and GPFS cluster support to HudsonAlpha for two years, to help the biotech institute maximise cluster availability, performance, and configuration consistency.” That indicates Quobyte is being used in preference to Spectrum Scale.
HudsonAlpha had a system scalability issue as well. It needed the ability to add another storage service to a cluster, and have that capacity immediately available.
The appliance-based platforms it evaluated had limits on the number of enclosures, the number of drives per enclosure, where certain drive types had to reside, and the mix of media the platform could tolerate. If users wanted to scale beyond the number of enclosures supported, a forklift upgrade with another large infrastructure and licensing investment was required. This made it infeasible to add resources on the scale of a few hundred terabytes and growth had to come in petabytes.
HudsonAlpha embarked on proof-of-concept trials with a number of vendors. The institute looked at both hybrid disk/flash and all-flash systems, with Johnson saying: “We looked at all-flash as a play to get as much performance as possible for our cluster, but in the end all of the flash vendors were not economical for the capacities we require.”
They became unaffordable when scaling to petabyte-class capacities. A less obvious problem was that every flash-based system Johnson examined was based on NFS. Even though the individual drives might offer exceptional performance, he said the throughput constraints introduced by the filesystem bottlenecked all single-stream processing. Ultimately, that was even more of a problem than price.
This testing resulted in Quobyte’s software being selected because it “offers us a great deal of flexibility for hardware configurations allowing us to scale capacity and performance in a variety of ways.” The software was flexible and admins can specify which file types should and should not be placed on SSD media based on the workload and other factors.
Also, Quobyte is hardware-agnostic and came at it from the standpoint of: “Show some more storage and we’ll give it to you… If we had a new project come online and we needed another petabyte of storage, we could literally order a couple servers, put them online in a matter of minutes, and have that capacity available to the Quobyte file system. If we were dealing with storage appliances, those abilities would be a lot more constrained.”
It provides direct API-accessed information to admin staff. Johnson said: “We’ve been able to structure the volumes in a way that I can get project-level capacity information at a glance. I can get user-level capacity information. I have real-time access to how big people’s home directories are, which we heavily use. We have scripts that run all the time to gather information and help us plot it. We use it for chargeback reports and showback reports and all kinds of things. It’s incredibly powerful.”
Quobyte has found that its software DNA resonates with HudsonAlpha and can cope with the torrent of genomics data being generated by its researchers.