Solarflare demos NVMe over TCP at warp speed

Solar flare.
Source Massive X-Class Solar Flare uploaded by PD Tillman; Author - NASA Goddard Space Flight Center

Who needs RoCE or iWARP if NVMe over TCP is this fast?

Dell has validated superfast Ethernet NIC-maker Solarflare’s NVMe over TCP technology, as being within 3 to 4 microseconds of RoCE NVMe.

This was much faster than Pavilion Data’s demonstration which showed NVMe TCP being about 75µs slower than NVME RoCE’s 107µs latency.

A 3 to 4µs latency difference between NVMe TCP and NVMe RoCE is immaterial.

NVMe over TCP uses ordinary Ethernet, not the more expensive lossless Data Centre Bridging (DCB) class required by RDMA-based NVMe RoCE. 

Ahmet Houssein, Solarflare’s VP for Marketing and Strategic Development, said the lossy nature of ordinary Ethernet is mostly due to congestion. If that is controlled better then the loss rate falls dramatically: “Our version of ordinary Ethernet is nearly lossless. … If you take [lossy Ethernet[ away why do you need RDMA and DCB extensions for Ethernet?”

“Dell says we’re almost within 3 per cent of RoCE.”

Possibly the Pavilion demo used an early version oof Lightbits technology

Houssein said Solarflare’s kernel bypass technology, which works with TCP in user space instead of having a switch to kernel space,  is not proprietary; POSIX-compliant Onload being available to anybody and its use needing no application re-writing.

Solarflare says Onload delivers half round trip latency in the 1,000 nanosecond range. In contrast the typical kernel stack latency is about 7,000 nanoseconds.

Solarflare NIC and kernel bypass

TCPDirect

Solarflare’s TCPDirect API builds on Onload by providing an interface to an implementation of TCP and UDP over IP. TCPDirect is dynamically linked into the address space of user-mode applications, and granted direct (but safe) access to Solarflere’s XtremeScale X1 hardware. 

Solarflare says TCPDirect under very specific circumstances with ideal hardware, can reduce latency from 1,000 nanoseconds to 20-30 nanoseconds. According to a Google cache version of the TCPDirect user manual; “In order to achieve this, [extreme low latency] TCPDirect supports a reduced feature set and uses a proprietary API.”

Note; “To use TCPDirect, you must have access to the source code for your application, and the toolchain required to build it. You must then replace the existing calls for network access with appropriate calls from the TCPDirect API. Typically this involves replacing calls to the BSD sockets API. Finally you must recompile your application, linking in the TCPDirect library.”

In contrast; “Onload supports all of the standard BSD sockets API, meaning that no modifications are required to POSIX-compliant socket-based applications being accelerated. Like TCPDirect, Onload uses kernel bypass for applications over TCP/IP and UDP/IP protocols.”

Pavilion Data

Pavilion Data’s head of product, Jeff Sosa, commenting on the Solarflare demo, said: “As far as direct comparisons, it probably doesn’t make sense to compare results since we were running real-world workloads at higher-QD and examining average latency measured from the host generating the IO, not a single-QD IO to try to find the lowest latency number possible.

“Also, our customers’ results were using a similar methodology, over multiple subnets/switch hops in cases. In addition, Pavilion delivers full data management capabilities in the array used to produce the results, including RAID6, snapshots, and thin provisioning (it’s not a JBOF).

“Even with all that, on a system workload of 2 million IOPS, we showed that the average latency of RDMA and TCP were fairly close, and thus the driving decision factor will be cost savings through the avoidance of specialized hardware and software stacks for many users.”

How does Pavilion view NVMe over TCP and NVMe over RoCE?

“Pavilion’s position is that we will support NVMe-oF over both ROCE and TCP transports from the same platform, and don’t favor one over the other.  Instead, we let our customers decide what they want to use to best meet their business requirements.   The average latency results we presented at FMS were measured end-to-end (from each host) using a steady test workload of 2 Million IOPS on the Pavilion Array, which was leveraging RAID6, thin provisioning, and snapshots.

“However, when we measure only a single-IO, the latency of ROCE and TCP are within a few usec, but this is not a scenario our customers care about typically.”

Sosa emphasised Pavilion is very interested in seeing the momentum around NVMe-oF with TCP grow, and believes it has a big future. Pavilion looks forward to working with the broader community of vendors to optimize NVMe over TCP even further as the standard gets ratified and the open host protocol driver makes its way into the OS distributions, which should drive even wider adoption and lower cost for customers deploying NVMe-oF-based shared storage.

NVMe-oF supplier reactions

It is our understanding, from talking to Solarflare, that all existing NVMe-over Fabrics suppliers and startups are adding NVMe TCP to their development roadmaps.

Houssein says that, with NVMe TCP, you don’t need all-flash arrays, merely servers talking NVMe TCP to flash JBODs. You don’t have to move compute to storage with this because the network pipe is so fast.

Pure Storage will support NVMe TCP in 2019. He thinks NVMe-oF array startups will move their value up the stack into software.

The prospect offered is that NVMe-oF is an interim or starting phase on the transition of all-flash arrays from Fibre Channel (FC) and iSCSI access to either NVMe over FC or TCP. The NVMe TCP standard will be ratified in a few months and then server and all-flash system suppliers will adopt it as it provides the easiest route for current Ethernet users to upgrade to NVMe-oF storage access speed. 

Ditto NVMe FC and FC SAN users.

We might expect customers to start adopting NVMe FC and TCP from 2019 onwards. once their main system suppliers have their NVMe TCP product offering ducks lined up in a row. B&