NVMe/TCP needs good TCP network design

Poorly-designed NVMe/TCP networks can get clogged with NVMe traffic and fail to deliver the low latency that NVMe/TCP is designed to deliver in the first place.

The SNIA has an hour-long presentation explaining how NVMe/TCP storage networking works, how it relates to other NVMe-oF technologies, and potential problem areas.

NVMe over TCP is interesting because it makes the fast NVMe fabric available over an Ethernet network without having to use lossless data centre class Ethernet components which can carry RDMA over Converged Ethernet (RoCE) transmissions.

Such Ethernet components are more expensive than traditional Ethernet. NVMe/TCP uses ordinary, lossy, Ethernet and so offers an easier, more practical way, to advance into faster storage networking than either Fibre Channel or iSCSI

The webcast presenters are Sagi Grimberg from Lightbits, J Metz from Cisco, and Tom Reu from  Chelsio, and the talk is vendor-neutral.

This talk makes clear some interesting gotchas with TCP and NVMe. First of all every NVMe queue is mapped to a TCP connection. There can be 64,000 such queues, and each one can hold up to 64,000 commands. That means there could be up to 64,000 additional TCP connections hitting your existing TCP network if you add NVMe/TCP to it.

If you currently use iSCSI, over Ethernet of course, and move to NVME/TCP using the same Ethernet cabling and switching, you could find that the existing Ethernet is not up to the task of carrying the extra connections.

Potential NVMe/TCP problems

NVMe/TCP has more potential problem areas; latency higher than RDMA, Head-of-line blocking adding latency, Incast adding latency, and lack of hardware acceleration.

RDMA is the NVMe-oF gold standard and NVMe/TCP could add s few microseconds of extra latency to it. But, in comparison to the larger iSCSI latency, the extra few microseconds are irrelevant, and won’t be noticed by iSCSI migratees. 

The added latency might be noticed by some latency-sensitive workloads, which wouldn’t have been using iSCSI in the first place, and for which NVMe/TCP might not be suitable.

Head-of-line blocking can occur in a connection when a large transfer can hold up smaller ones while it waits to complete. This may happen even when the protocol breaks large transfers up into a group of smaller ones. Network admins can institute separate read and write queues so that, for example, there are separate read and write queues. NVMe also provides a priority scheme for queue arbitration which can be used to assuage any problem here.


Think of Incast as the opposite of broadcast, with many synchronised transmissions coming to a single point, which forms a congestion bottleneck, through a buffer overflow, with the sessions backing off, and the affected packets dropped, causing a retransmission and added latency.

It could be a problem and might be fixed by switch and NIC (Network Interface Card) vendors upgrading their products and possibly by TCP developers with technologies like Data Centre TCP. The idea would be to tell the sender somehow, by explicit congestion detection and notification,  to slow down before the buffer overflow happens, The slowing itself would add latency but not as much as an Incast buffer overflow. Watch this space.

HW-accelerated offload devices could reduce NVMe/TCP latency below that of software NVMe/TCP transmissions. Suppliers like Chelsio and others could introduce NVMe/TOEs; NVMe TCP Offload Engine cards, complementing existing TCP Offload Engine cards.

The takeaway here is that networks should be designed to carry the NVMe/TCP traffic and that needs a good estimate of the added network load from NVMe. 

This SNIA webcast goes into this in more detail and is well worth watching by storage networking and general networking people considering NVMe/TCP.