High Danger of Defect: Machine learning model predicts potential disk failures in Google’s DCs

Google has devised a machine learning (ML) model that predicts disk failures with 98 per cent accuracy. The idea is to reduce data recovery work when disks actually fail.

According to a Google blog by technical program manager Nitin Agarwal and AI engineer Rostam Dinyari, Google has millions of hard disk drives (HDDs) under management, some of which fail. “Any misses in identifying these failures at the right time can potentially cause serious outages across our many products and services.”

When a disk in Google’s data centres encounters non-fatal problems, short of an actual crash, then data is (drained) read from the drive. The drive is then disconnected from production use, they apply diagnostics and it is fixed and returned to production. Google wanted a way to predict such disk failures accurately to avoid the time-suck.

It worked with its main HDD provider, Seagate, and consultants from Accenture to build a disk failure prediction machine learning model. This was based on its two most common drive types, using Google Cloud. The aim was to predict the probability of a recurring failing disk – a disk that fails or has experienced three or more problems in 30 days.

For the model to work, it needs feeding with data. As well as building the ML model, the engineers set up an automated data pathway through which HDD telemetry data could travel to a vault and be fed to the models.

The hundreds of parameters of source HDD data, terabytes of it, came from billions of rows of drive-level SMART (Self-Monitoring, Analysis and Reporting Technology) data read in every hour. This was accompanied by host systems data, repair logs, Online Vendor Diagnostics (OVD) or Field Accessible Reliability Metrics (FARM) logs, and manufacturing data about each disk drive.

Google’s data pipeline for its machine learning disk failure production model

How it works

A DevOps approach was used to meld the data pipeline to the model and feed it data, and this process was called MLOps. It used Google products and services such as AutoML Tables, Terraform, Tensorflow, BigQuery and Dataflow. Two models were devised: an AutoML Tables classifier and a custom deep-Transformer-based model using Tensorflow.

The AutoML Tables design used aggregates of time-series features, such as the minimum, maximum, and average read error rates for a disk. This was concatenated with features that were not time-series, such as drive model type.

The alternative Transformer model uses direct feeds of raw time series data. Non-time series data feeds into a deep neural network and the output of this plus the Transformer model is concatenated and used to predict the likelihood of failure.

Once the models are deployed, their predictions are stored and then compared with actual drive repair logs after 30 days. The AutoML model achieved a precision of 98 per cent … compared to a precision of 70-80 per cent … from the custom Transformer/deep neural network design.

The blog authors say: “We were also able to explain the model by identifying the top reasons behind the recurring failures and enabling ground teams to take proactive actions to reduce failures in operations before they happened.”

Elias Glavinas, Seagate’s director of Quality Data Analytics, Tools & Automation, was quoted as saying: “AutoML Tables, specifically, proved to be a substantial time and resource saver on the data science side … with model prediction results that matched or exceeded our data scientists’ manual efforts. Add to that the capability for easy and automated model retraining and deployment, and this turned out to be a very successful project.”

The engineers close by saying: “The business case for using an ML-based system to predict HDD failure is only getting stronger. When engineers have a larger window to identify failing disks, not only can they reduce costs but they can also prevent problems before they impact end users. We already have plans to expand the system to support all Seagate drives.”