Backblaze drive type profiling predicts replacement and migration

Cloud storage provider Backblaze is developing drive type profiling models to optimize its drive replacement and migration strategies.

The work is detailed in the latest edition of its quarterly disk drive annual failure rate (AFR) statistics blog. Author and principal cloud storage storyteller Andy Klein says: “One of the truisms in our business is that different drive models fail at different rates. Our goal is to develop a failure profile for a given drive model over time.”

He started by “plotting the current lifetime AFR for the 14 drives models that have an average age of 60 months or less” drawing a chart of drive average age versus cumulative AFR and dividing it into four quadrants: 

We can instantly see that the left hand two quadrants have most drives in them, and the top right quadrant – for older drives with higher cumulative AFRs – has only two drive models in it.

Klein characterizes the quadrants as:

  • 1. Older drives doing well – with ones to the right having higher AFRs;
  • 2. Drives with >1.5 percent and around 2 percent AFRs – “What is important is that AFR does not increase significantly over time”;
  • 3. The empty quadrant – it would be populated if any of Backblaze’s drives exhibited a bathtub curve failure rate; pattern with failures in their early days, a reliable mid-period and subsequent failures as they age
  • 4. Younger drives – with low failure rates.

Next Klein drew a similar chart for drives older than 60 months: 

Now there is a more equable distribution of drives across the four quadrants. He says: “As before, Quadrant I contains good drives, Quadrants II and III are drives we need to worry about, and Quadrant IV models look good so far.” The 4TB Seagate drive (ST4000DM000) in quadrant 2 looks “first in line for the CVT migration process.” CVT stands for Backblaze’s internal Cluster, Vault, Tome migration process. [See bootnote.]

Klein next looked at the change in failure rates for these drives over time in a so-called snake chart:

This chart starts at the 24-month age point and it shows that “the drive models sort themselves out into either Quadrant I or II once their average age passes 60 months,” except for the black line – Seagate’s ST4000DM000 4TB model.

Five drives are in quadrant number 1. “The two 4TB HGST drives (brown and purple lines) as well as the 6TB Seagate (red line) have nearly vertical lines “indicating their failure rates have been consistent over time, especially after 60 months of service.”

Two drives exhibit increasing failure rates with age – 8TB Seagate (blue line) and the 8TB HGST (gray line) – but both are now levelling out.

Four drives are in quadrant 2. Three of them – the 8TB Seagate (yellow line), the 10TB Seagate (green line), and the 12TB HGST (teal line) – show accelerated failure rates over time. Klein writes: “All three models will be closely watched and replaced if this trend continues.”

The 4TB Seagate drive (ST4000DM000 and black line) “is aggressively being migrated and is being replaced by 16TB and larger drives via the CVT process.”

Looking at all these curves, Klein believes that the 8TB Seagate (ST8000DM002) is normal as it started out with a 1 percent AFR to the 60 month point and then, as expected, its AFR increased towards 1.5 percent.

He says the  two 4TB HGST drive models (brown and purple lines) have “failure rates … well below any published AFR by any drive manufacturer. While that’s great for us, their annualized failure rates over time are sadly not normal.”

Klein believes that using Gen AI large language models (LLMs) to predict drive failure rates is a no-go area for now. Training a model on one drive type’s failure profile doesn’t mean the model can predict another drive type’s failure profile. He observes: “One look at the snake chart above visualizes the issue as the failure profile for each drive model is different, sometimes radically different.”

Backblaze’s drive set data is freely available here. He points out anyone can use it but: “All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, 3) you may sell derivative works based on the data, but 4) you can not sell this data to anyone; it is free.”

Bootnote

A Tome is a logical collection of 20 drives, “with each drive being in one of the 20 storage servers in a given Vault.” A storage server could possess 60 HDDs and hence 60 unique tomes in the vault. A Cluster is a logical collection of Vaults, which can have any combination of vault sizes.