ECC

ECC – Error checking and Correction. ECC is a mechanism used to detect and correct errors in memory data due to environmental interference and physical defects. Most memory errors are single (1-bit) errors caused by soft errors (eg. cosmic rays, alpha rays, electromagnetic interference) but some can be due to hardware faults (eg. row hammer fault). Single bit errors can be corrected by ECC memory systems. Multi-bit errors, may also be detected and/or corrected, depending on the number of symbols in error.

ECC is implemented by generating and storing an encrypted, parity-like code used to not only identify the bit in error but correct it as well. This implementation-dependent ECC code is generated and stored on writes, and verified on reads. The most common implementations use Hamming codes for single-bit correction and double-bit detection. Hamming codes define parity bits which cover a pre-defined set of data bits. Typically, an 8-bit hamming code is used to protect 64-bit data. Hamming codes can detect one-bit and two-bit errors, or correct one-bit errors without detection of uncorrected errors.

ECC memory is a type of DRAM used in workstations and servers. ECC memory differs from non-ECC memory as it has nine memory chips instead of the usual eight, with the ninth chip being used for error detection and correction among the other eight memory chips.