WekaIO is being used for a genome sequencing installation’s new project where an existing project’s temporary Isilon performance problem, due to near 100 per cent capacity usage, has been resolved.
In 2015 Genomics England deployed 7PB of generation 5 Isilon clustered NAS flash and disk drive storage from EMC for its 100,000 Genomes project. That project is being followed on with a separate five million genomes project and WekaIO’s filesystem SW has been selected for that.
David Ardley, director of technology at Genomics England, said: “Our legacy storage system had already reached its limit and performance had deteriorated. We needed a modern storage solution that could scale to hundreds of petabytes while maintaining performance scaling, and it had to be simple to manage at that scale.”
However, our understanding from talking to people close to Dell and the project, is that the the deployed Isilon system became virtually full up, running at more than 95 per cent of its configured capacity. Following Dell EMC recommendations it’s capacity was increased and performance returned to the desired level.
In its latest SW iteration the Isilon OneFS file system software can support 4 times larger files, up to 16TB in size. Gen 6 Isilon systems also scale out further and perform faster than earlier gen 5 systems.
100,000 Genome Project
Originally, Genomics England (GEL) wanted to sequence 100,000 genomes from 70,000 people, including NHS patients and their families. Its goal is to provide better disease response by optimising medication for genomes – a person’s DNA structure – and identifying patients at risk from diseases linked to their genome types.
GEL works on large files and looks for common patterns. It requires parallelised access to a library of files, up to 240GB in size, held in network-attached storage (NAS). In 2015, Isilon provided the best kit for this task, with DRAM and flash for metadata, and bulk sequenced genome data stored in SATA disk drive.
Backup services were provided by Dell EMC’s Data Domain and Networker. In September 2016 GEL decided to additionally use an Isilon data lake to store all the data collected during the sequencing process for it to be analysed.
The data lake was then sized at 17PB. GEL also bought 24 all-flash XtremIO X-Bricks to provide faster block storage for its applications. At that point it had sequenced 13,040 genomes.
GEL completed its 100,000th sequence in December 2018 and has amassed the world’s largest database of whole genome sequences with associated clinical data. The Isilon genome sequencing storage system and data lake is still operational.
Pilot 20,000 baby genomes project
There is a pilot GEL project in which 20,000 babies will be given whole-genome sequencing to detect their liability for epilepsy, cystic fibrosis and other conditions. NHS England operates the national NHS Genomic Medicine Service (GMS) and intends to integrate genomic medicine with routine NHS care by 2025.
The NHS GMS will be deployed across England from April 2020 and comprises seven networked genomic laboratory hubs in an NHS genomic medicines centre infrastructure. A national genomic test directory and whole-genome sequencing will be available nationwide with an integrated clinical service.
In comes WekaIO
NHS England has now decided to sequence five million whole genomes by 2024. That means a genome library in the 100s of petabytes. GEL has decided that WekaIO’s file system is the one to use.
A linear projection from 100,000 genome sequences at 25PB to five million sequences entails 1,250PB of data lake storage.
Ardley said he likes WekaIO’s combination of flash for performance and object store for scale, with data tiered from disk to flash.
Blocks & Files point out that Isilon systems can be similarly configured, and tier data to the public cloud if desired.
WekaIO CEO Liran Zvibel said: “The Weka File System has delivered a 10x performance improvement over GEL’s legacy NFS-based NAS and is enabling more effective use of existing cloud infrastructure [This will] improve overall productivity and empower researchers to become more efficient at analysing results.”
Blocks & Files‘ understanding is that a clustered Isilon system can scale up and out to that level, to exabyte scale. In fact it has customers storing well past 100PB on its systems already, with newer hardware outperforming the fifth generation..
Such exabyte-class storage systems are good news for genome sequencing.
Genomics England Chief Scientist Professor Mark Caulfield said: “As the UK database expands to five million sequences and beyond, new insights will help to save many lives, both in the NHS and around the world.”