Pavilion Data adds drive rebuild swarms, extra checksums and standby controllers

Pavilion Data has added three availability and resilience features to its all-flash array products to guard against controller and SSD failures.

The company makes parallel, multi-controller NVMe-oF-accessed all-flash arrays offering high and scalable performance. The three data assurance features are aimed at heightening its availability and data resiliency features.

V.R. Satish.

V.R. Satish, Pavilion co-founder and CTO, issued a statement saying: “As big and fast data applications become business critical, Pavilion allows customers to manage their modern workloads with the same SLAs as traditional workloads with its new features.

The first new feature is controller redundancy. With up to 20 active:active controllers, Pavilion has always offered high availability but, should a controller fail, the overall available controller resource shrinks.

Pavilion hardware unit.

Now there is N+1 controller redundancy which adds additional controllers operating in standby mode: an active: passive scheme. The result is that the aggregate performance of up to 16 controllers per 4RU array can be performing I/O at any given time, with up to 4 additional controllers on standby. Should a controller fail there is no loss of performance.

Second is swarm rebuild. When a component SSD fails its contents are rebuilt in Pavilion’s RAID 6 (dual parity) scheme. That work is now being spread across multiple controllers, a swarm, to speed rebuilds up to a 4TB/hour rate. Pavilion looked at erasure coding as an alternative drive-level protection scheme but says the amount of extra storage capacity needed is unacceptable.

For comparison, an 8TB SATA SSD rebuild time of 4.4 hours was quoted by Anandtech in March last year. That’s 1.8TB/hour.

A third new feature deals with internal data write failures in SSDs. This can happen when the drive tells the system’s metadata that a write has completed but the actual data block is not written. Pavilion says this is a rare event but statistically significant errr. Its software now adds a version checksum number to every 4k block, as well as the standard cyclic redundancy check (CRC) that is done as part of a T10 Data Integrity Field (Dif) operation.

When data is written with standard T10 Dif there is a CRC done, which is compared on read to confirm that the data is valid. But if there was supposed to have been a write to that data, which did not occur, the CRC will not catch it. Pavilion adds an additional checksum to confirm that the data is both valid and current.

Satish said: “Customers can now treat their modern workloads as Tier 1 applications when it comes to data resiliency and availability.” Or, customers can view Pavilion arrays as equivalent in data resiliency and availability terms to the legacy high-end arrays they have used for mission-critical applications and data.