Autor: Alexander Eichhorn

Storage reliability

Storage technology is not flawless.

Especially in a professional environment it is important to know its limits and depending on the use case mitigate the risk of losing data due to drive failures and silent data corruption. Different technologies bear different inherent risks of data corruption or complete loss. Unless overwritten, the data from a failed drive, even after a fire, is likely reconstructable with the help of specialized data recovery companies, but prices for such services are exorbitant.

In practice the reliability of disk drives and other storage media depends on many factors. The most important ones emerge from the physical environment in which a drive is used and the operational conditions. High temperature, high vibrations and shock, many start/stop cycles and spikes in power supply stability are bad for all media, in particular for HDD and SSD drives. For SSD drives the number of write cycles and read/write patterns play an important role as well. For optical media, in particular media that uses anorganic material (DVD, BlueRay) it's also the physical handling that becomes important because plain discs should not be touched by human operators to avoid fungi and acid infections:

Silent Data Corruption

Any computer component that handles bits of data is subject to hardware and software faults. It could be main memory, the network interface cards, hard disk, the internal bus, the file system, the applications that copy and backup data. Silent data corruption happens all the time. It can impact every file and every piece of information which is actively processed or at rest on storage media. Reasons are hardware errors, firmware and software bugs, electromagnetic interference, signal noise, electricity fluctuations and more.

Many components in a computer already provide some protection, for example media that is magnetically stored is protected by error-correction codes that add some redundancy to correct for single bit errors. ECC memory contains an extra parity bit per byte of memory to at least detect corruption. File systems keep redundant index structures on disk.

However, if one of the measures fails, data is inevitably corrupted. Unless applications keep verification checksums such errors go unnoticed by the end user, and that's why they are called silent. For this reason all modern file systems (ZFS, BTRFS) maintain checksums and keep multiple copies of data, but before they are in wide-spread use, it's important to keep multiple backups AND checksums for every file.

Hard drives can repair media errors by remapping bad sectors to spare sectors internally. The drive keeps statistics of such events which can be extracted with a tool that can issue SMART commands to the drive such as `smartctl` on Linux. Growing values for SMART values 5 (Reallocated Sector Count), 187 (Reported Uncorrectable Errors), 197 (Current Pending Sector Count) and 198 (Offline Uncorrectable) are indicators of wear.

Software for health checks

  • Linux/OSX: smartctl (see smartmontools), fsck
  • OSX: Disk Utility > First Aid
  • Windows: Drive: > Properties > Tools > Error Checking (or chkdsk on the command line)

New file systems like ZFS and BTRFS are best prepared to tackle silent data loss. They  internally generate block-wise and file-wise data checksums and keep redundant copies of data blocks without the need for extra RAID controllers. Unlike many other file systems that protect index structures only, ZFS and BTRFS can transparently repair potential errors in data blocks when reading from disk.

Cloud storage internally suffers from the same types of errors because data is stored on HDD, SSD or tape libraries too. Some of the errors are dealt with by cloud operators who employ redundancy and automatic failover to hide that many of their drives actually fail per day. Cloud storage uses optional checksums to verify data integrity, but in rare cases files may be corrupted too. Cloud storage also exposes other kinds of errors such as a temporary loss of access to data. Cloud operators are offering service level agreements (SLA) that define guaranteed availability as percentage.

Related Articles: