Updated 28 March 2020.
HPE has told customers that four kinds of SSDs in its servers and storage systems may experience failure and data loss at 40,000 hours of operations.
The company said in a bulletin that the “issue is not unique to HPE and potentially affects all customers that purchased these drives.”
HPE issued a statement on 28 March 2020, which said: “HPE was notified by Western Digital of a firmware issue in a specific line of older end-of-life SanDisk SAS solid state drive (SSD) models used by select OEM customers. The defect causes drive failure after 40,000 hours of operation; no HPE customers are in danger of immediate failure. HPE is actively reaching out to impacted customers to provide updated firmware that addresses the issue.”
A Dell EMC urgent firmware update issued last month also mentioned SSDs failing after 40,000 operating hours and specifically identified SanDisk SAS drives. The update included firmware version D417 as a fix.
The fault fixed by the Dell EMC firmware concerns an Assert function which had a bad check to validate the value of a circular buffer’s index value. Instead of checking the maximum value as N, it checked for N-1. The fix corrects the assert check to use the maximum value as N.
Blocks & Files asked Western Digital, which acquired SanDisk in 2016, for comment. A company spokesperson said: “Per Western Digital corporate policy, we are unable to provide comments regarding other vendors’ products. As this falls within HPE’s portfolio, all related product questions would best be addressed with HPE directly.”
For HPE customers SSD Firmware Version HPD7, is available to remedy the affected drives which are:
- EK0888JVYPN – HPE 800GB 12G SAS WI-1 SFF SC SSD – WI meaning write-intensive
- EO1600JVYPPHPE – HPE 1.6TB 12G SAS WI-1 SFF SC SSD
- MK0800JVYPQ – HPE 800GB 12G SAS MU-1 SFF SC SSD – MU meaning mixed use
- MO1600JVYPR – HPE 1.6TB 12G SAS MU-1 SFF SC SSD
The drives will suffer data loss entailing recovery from a backup unless they are arranged in a RAID scheme that provides protection against drive failure. If the RAID scheme uses more than one affected drive you should consider all are at risk.
Forty-thousand hours is equivalent to four years, 206 days, 16 hours. This implies that the first affected drives were switched on in late 2015.