NVMe v1.4 resolves data centre SSD noisy neighbour problems

At the risk of ruining the ending … Data centre SSDs have their own domain of stubborn problems and they are not remotely like client SSDs. Remedies do not have to include re-inventing the SSD, as over- engineering and NVMe v1.4 fix the problems.

How noisy neighbour and long tail latency problems happen 

Latency explained – think of network wire speed as a fast motorway with sparse traffic moving at the speed limit. Speed limits are analogous to maximum network wire speeds. Latency is bad because it slows effective network wire speed.

Think of latencies like stoplights on that motorway. Stop lights cause traffic to pile up, increase travel times and diminishes the amount of total traffic handled. Stoplights are analogous to data centre system latencies.

Noisy neighbours cause latency problems. 

Latency problems are uniquely important in data centre and similar multi-user situations. How and why?

SSD noisy neighbour problems arise when several concurrent flash writes or multiple container workloads compete for the same SSD resources. This causes increased latency.

Noisy neighbour problems are increasing due to three industry trends: larger capacity SSDs, SSDs with reduced write performance, and SSDs with lower endurance. I’ll explain each. 

  • Larger capacity SSDs store more data and therefore serve more and more simultaneous I/O. More I/O to a single device increases the probability of noisy neighbour problems. 
  • Reduced write performance SSDs with each generation of NAND; SLC to MLC to TLC to QLC, writes are slower. This increases the probability of simultaneous writes and noisy neighbours.
  • Lower endurance SSDs with each generation of NAND, SLC to MLC to TLC to QLC, write endurance is diminished increasing garbage collection, and error correction, increasing the probability of conflicting writes and noisy neighbors.

Why this matters … Latency-increasing noisy neighbours become costly in high-value and time-sensitive data centre workloads such as credit card fraud analytics, with a time-based response SLA.

Additionally, Noisy neighbouring becomes high impact in many clustered database applications where the query completes only after the slowest SSD responds.

NVMe v1.4 as the noisy neighbour and long tail latency remedy 

NVMe v1.4 was released in July 2019 and focuses on cloud/hyperscale features. 

NVM Sets serves to isolate noisy neighbours by separating and allocating NAND media so workloads (or containers) using one NVM Set does not impact other workloads on other sets.

In the diagram below: NVM set A is separate and isolated from NVM set BNVM Set A consists of physical die ‘NS A1; physical die ‘NS A2’; and physical die ‘NS A3’. 

NVMe Sets isolate noisy neighbours

NVMe IO Determinism eliminates read latency outliers caused by SSD housekeeping.

A chunk of time (shown below in green) is allocated to deliver predictable read latency. This is the deterministic mode. Another chunk of time (shown in red) is allocated for housekeeping and read latency is then unpredictable. This is non-deterministic mode.

NVMe deterministic IO gets interesting when applied with multiple SSD when IO determinism is coordinated across a group of SSDs. SSDs in deterministic mode are employed while SSDs in non-deterministic mode are conveniently omitted from service. This remedies the unpredictable read latency problem.

Don’t believe me… believe Facebook – the solution for inconsistent latency and consistent quality of service is NVMe Sets as set out in its Flash Memory Summit 2018 presentation.

The NVM Express 1.4 specification can be found here.

Short term; there are worse things than over engineering. Simply spreading the workload over more servers with more SSDs is reasonable. Also organise for additional temporary servers and SSDs during peak times.

Longer term; the more elegant, more affordable, and long term remedy for noisy neighbour and latency determinism is NVMe v1.4.

Note: Consultant and patent holder Hubbert Smith (Linkedin) has held senior product and marketing roles with Toshiba Memory America, Samsung Semiconductor, NetApp, Western Digital and Intel. He is a published author and a past board member and workgroup chair of the Storage Networking Industry Association and has had a significant role in changing the storage industry with data centre SSDs and Enterprise SATA disk drives.