Facebook investigates silent data corruption

Facebook has flagged up the problem of silent data corruption by CPUs, which can cause application failures and undermine data management systems in its super dense datacenters.

An academic paper detailing the research says: “It is our observation that computations are not always accurate. In some cases, the CPU can perform computations incorrectly.” Unlike soft errors due to radiation or other interference, “silent data corruptions can occur due to device characteristics and are repeatable at scale. We observe that these failures are reproducible and not transient.”

Moreover, the researchers say: “CPU SDCs are evaluated to be a one in a million occurrence within fault injection studies.” Scale that up by the number of processors and computations Facebook’s infrastructure accommodates and the implications are obvious. The researchers state it “can result in data loss and can require months of debug engineering time.”

The researchers wrote: “It is our observation that increased density and wider datapaths increase the probability of silent errors. This is not limited to CPUs and is applicable to special function accelerators and other devices with wide datapaths.”

This can have effects at application level, including on Facebook’s data compression technology to reduce the footprint of its datastores, leading to the possibility of files being missed from databases, ultimately leading to application failure. “Eventually the querying infrastructure reports critical data loss after decompression,” said the researchers.

The answer is testing to uncover errors. But servers are tested for a few hours by the vendor then by an integrator for at best a couple of days.  After they go into production, Facebook said “it becomes really challenging to implement testing at scale.”

Testing times

The giant’s answer is to combine two types of testing regimes. Opportunistic testing means piggy backing on other routine maintenance events such as reboots, kernel or firmware upgrades, or device reimages, host provisioning, and namespace reprovisioning. “We implement this mechanism using Fleetscanner, an internal facilitator tool for testing SDCs opportunistically,” said Facebook.

But this alone is not enough. Facebook also committed to ripple testing, essentially running SDC detection in conjunction with production workloads. “Ripple tests are typically in the order of hundreds of milliseconds within the fleet,” the researchers wrote. This results in “a footprint tax” but this is negligible in comparison with other management activities.

The result is faster time to detecting errors that “can have a cascading effect on applications… As a result, detecting these at scale as quickly as possible is an infrastructure priority toward enabling a safer, reliable fleet.”