Object storage is ill-suited to supporting AI, ML, and HPC workloads because it can’t pump data fast enough to the many processors involved. File systems can do that but cloud file storage is expensive – compared to S3 object storage.
Paradigm4’s CTO Gary Planthaber has been deeply involved in bridging this divide by creating the flexFS file system to turbocharge object storage with a POSIX-compliant, fast throughput file access front end. He joined Paradigm4, a provider of large scale data set management software for life sciences customers, in 2018. That software, which includes the SciDB database, has distributed computing and parallel processing capabilities.
Planthaber has written a Medium article about the how-and-why of flexFS’s foundation, and explains that Paradigm4 wanted to help pharma companies develop drugs by looking into massive datasets of genomic, medical history, and diagnostic scan data for 500,000 or more patients, and population-scale biobanks. It wanted to trace genetic element links to ill health using phenotypes (observable traits of an individual determined by their genotype and the environment) and genotypes (genetic makeup).
This analytical processing work involves “large clusters comprising hundreds of powerful servers working together to provide researchers with answers within a reasonable period of time.” Typically this work is done in the public cloud so that the necessary servers can be spun up when needed and then shut down when the job is done. This saves server provisioning and operating cost.
But there is so much data that each computational node only has access to parts of it, which are streamed from the data storage source – hundreds or thousands of GB in size – and not stored on direct-attached storage drives. Planthaber explained the core requirement was “to maximize cluster-wide throughput” with latency being a secondary concern as so much data is being read. He needed a rate of around 1GB/sec per cluster server node – 500GB/sec for a 500 server cluster.
Amazon’s EBS was not suitable as: “It turns out that the sustained throughput of EBS for large files is actually not very high.” Amazon EFS also did not match the need as “the EFS limited pool of throughput gets exhausted easily when concurrently transferring gigabytes of data to hundreds of consumers.” EFS provisioned throughput is expensive and “when hundreds of servers are performing sustained reads of several gigabytes of data concurrently, we witnessed a complete meltdown. Unfortunately, that happened to be our primary use-case, so we ultimately had to abandon EFS.”
Amazon FZSx for Lustre was prohibitively expensive as: “we would have to provision far more storage capacity than we needed in order to get enough aggregate throughput to service a cluster of hundreds of servers with anywhere near 1GB/sec throughput each.”
Amazon S3 could meet Paradigm4’s cost and throughput goals, but is not a file system and “the tools that we need to run on our analysis servers expect to see files as inputs.”
Planthaber said he realized that: “What we really wanted was a proper POSIX network file system with the pricing, durability, and aggregate throughput of S3” with file data stored in physical blocks to mitigate S3 API charges and latency. There were several open-source S3-backed file systems, such as s3backer, but they were all unsuitable. For example, “we couldn’t use s3backer to read data and also write results at the same time on multiple servers.”
So “we decided to build our own commercial S3-backed file system” using Paradigm4’s own software engineers – the result is flexFS.
Planthaber has written an Introduction to flexFS article in which he notes it features POSIX-compliance and Linux advisory file locking. He mentions that: “We also went to great lengths to ensure that XATTRs and extended ACLs are fully supported by flexFS.”
To counter S3’s high latency, he writes: “We decided to separate file system metadata handling into a lightweight, purpose-built metadata service with low latency and ACID compliance. This adds a small amount of infrastructure cost but pays dividends through much faster metadata performance and safety.”
Also, “By assigning persistent IDs to every file inode in flexFS and only building S3 keys using these inode IDs rather than dentries, we can perform file system path operations (mv, cp, ln, etc.) entirely within our low-latency metadata service without touching S3 at all.”
File data partitioning is dealt with like this: “The address space of all file data is partitioned into fixed-sized blocks (i.e., the file system block size), which map to deterministic indices. … Having file data partitioned into fixed-sized address blocks allows flexFS to support highly parallel operations on file data and contributes to its high-throughput characteristics. It also allows flexFS to store extremely large files that go well beyond the 5TB size limit S3 imposes on objects.”
Compression is also used to cut down storage space and network time. “By default, flexFS compresses all file data blocks using low-latency LZ4 compression. Though this feature can be disabled, it nearly always yields performance and storage cost benefits.
“We have, in practice, seen dramatic reductions in the storage space needed by S3 file data objects, even when source files had already been compressed. For example, we can see a real-world HDF5 file with 1.6TB of data has been reduced to 672GB of data stored in S3.”
Both file system data and metadata are encrypted.
All-in-all, “We built flexFS to be a high-throughput, cloud-native network file system that excels at concurrently streaming large files to large clusters of servers.” The software was launched in 2023.
It works and is successful, Planthaber observed. ““”flexFS is being used in production by multiple top-tier biotech and pharmaceutical organizations to analyze large volumes of mission-critical data.” Other customers operate in the climate science and insurance risk modeling areas.
As it has a pluggable backend architecture, flexFS can also be used in the Azure and Google public clouds and others as well.
For more details contact Paradigm4 at flexfs@paradigm4.com.