OpenIO ‘solves’ the problem with object storage hyperscalability

OpenIO, a French startup, is claimed the speed record for serving object data – blasting past one terabits-per-second to match high end SAN array performance.

Object storage manages data as objects in a flat address space spread across multiple servers and IO speed is inherently slower than SAN arrays or filer systems. However, in August 2019 Minio showed its object storage software can be a fast access store – and now OpenIO has demonstrated its technology combines high scalability and fast access.

Laurent Denel, OpenIO CEO

The company created an object storage grid on 350 commodity servers owned by Criteo, an ad tech firm, and achieved 1.372 Tbit/s throughput (171.5GB/sec). Stuart Pook, a senior site reliability engineer at Criteo, said in a prepared quote: “They were able to achieve a write rate close to the theoretical limits of the hardware we made available.”

OpenIO claimed this performance is a record because to date no other company has demonstrated object storage technology at such high throughput and on such a scale.

To put the performance in context, Dell EMC’s most powerful PowerMax 8000 array does 350GB/sec. Hitachi Vantara’s high-end VSP 5500 does 148GB/sec, slower than the OpenIO grid. Infinidat’s InfiniBox does 25GB/sec. 

Laurent Denel, CEO and co-founder of OpenIO, said:  “Once we solved the problem of hyper-scalability, it became clear that data would be manipulated more intensively than in the past. This is why we designed an efficient solution, capable of being used as primary storage for video streaming… or to serve increasingly large datasets for big data use cases.”

Benchmark this!

Blocks & Files thought that the workload on the OpenIO grid had to be spread across the servers, as parallelism was surely required to reach this performance level. But how was the workload co-ordinated across the hundreds of servers? What was the storage media?

Blocks & Files: How does OpenIO’s technology enable such high (1.37Tbit/s) write rates? 

OpenIO: Our grid architecture is completely distributed and components are loosely coupled. There is no central, nor distributed lock management, so no single component sees the whole traffic. This enables the data layer, the metadata layer and the access layer (the S3 gateways) to scale linearly. At the end of the day, performance is all about using the hardware of a server with the least overhead possible and multiplying by the number of servers to achieve the right figure for the targeted performance. That makes it really easy!

Blocks & Files: What was the media used?

OpenIO: We were using magnetic hard drives in 3.5 LFF format, of 8TB each. There were 15 drives per server; 352 servers. This is for the data part. The metadata were stored on a single 240GB SSD on each of the 352 servers. OpenIO typically requires less than 1% of the overall capacity for the metadata storage.

Blocks & Files: What was the server configuration? 

OpenIO: We used the hardware that Criteo deploys in production for their production datalake… These servers are meant to be running the Hadoop stack with a mix of storage through HDFS and big data compute. They are standard 2U servers, with 10GigE ethernet links. Here is the detailed configuration;

  • HPE 2U ProLiant DL380 Gen10, 
  • CPU 2 x Intel Skylake Gold 6140
  • RAM 384 GB (12 x 32 GB – SK Hynix, DDR4 RDIMM @2 666 MHz)
  • System drives 2 x 240 GB SSD (only one used for the metadata layer) Micron 5100 ECO
  • Capacity drives SATA HDD – LFF – Hotswap 15 x 8 TB Seagate
  • Network 2 x SFP+ 10 Gbit/s HPE 562FLR-SFP + Adapter Intel X710 (only one attached and functional)

Blocks & Files: Was the write work spread evenly across the servers?

OpenIO: Yes. The write work was coming from the production datalake of Criteo (2.500 servers), i.e. it was coming from many sources, and targeting many destinations within the OpenIO cluster, as each node, each physical server, is an S3 gateway, a metadata node and a data node at the same time. Criteo used a classic DNS Round Robin mechanism to route the traffic to the cluster (350 endpoints) as a first level of load balancing.

As this is never perfect, OpenIO implements our own load balancing mechanism as a secondary level: each of the OpenIO S3 gateways is able to share the load with the others. This produced a very even write flow, with each gateway taking 1/350 of the whole write calls.

Blocks & Files: How were the servers connected to the host system?

OpenIO: There is no host system. It is one platform, the production datalake of Criteo (2,500 nodes) writing data to another platform, the OpenIO S3 platform (352 nodes). The operation was performed through a distributed copy tool from the Hadoop environment. The tool is called distCp and it can read and write, from and to HDFS or S3.

From a network standpoint, the two platforms are nested together, and the servers from both sides belong to the same fully-meshed fabric. The network is non-limiting, meaning that each node can reach its theoretical 10 Gbit/s bandwidth talking to the other nodes.

Blocks & Files: How many hosts were in that system and did the writing?

OpenIO: It was 352 nodes running the OpenIO software, every node serving a part of the data, metadata and gateway operations. All were involved in the writing process… the nicest part, is that with more nodes, or more network bandwidth and more drives on each node, the achievable bandwidth would have been higher, in a linear way. As we are more and more focused on providing the best big data storage platform, we believe that the performance and scalability of the design of OpenIO will put us ahead of our league.