DATA



Checksums

Protecting the integrity of data or at least detecting silent data corruption or malicous manipulation is a key requirement for storage and communication systems.

Different checksumming procedures are in widespread use today. All procedures have in common that they are efficient to compute as data passes through a system and all procedures generate a very small amount of verification data that can be efficiently stored and transported as metadata.

Verification in all examples below happens in three steps. First, a checksum is computed over the original data before storage or transport. Then, as data is retrieved or received a second checksum is computed. Third, both checksums are compared. Any difference indicates the data has changed in the meantime. Usually, checksums alone cannot be used to repair corruption or to identify the exact location of a change. Some form of data redundancy (identical copies or error codes) is required to achieve this.

Checksums

A checksum is a small datum generated from a block of digital data. It's purpose is to detect errors in communication over networks and errors introduced by storage systems. Single check digits and parity bits are simple examples of checksums, appropriate for small blocks of data such as credit card or bank account numbers and memory bytes. ECC-RAM and RAID-5 storage arrays, for example, use parity bits or parity bytes for data protection. Such simple checksums can detect single bit or single digit errors and sometimes swapping of digits. Computationally more expensive checksums, such as cyclic redundancy checks (CRC) are able to detect complex multi-bit errors, insertion and deletion of bits and digits or changed order of a bit sequence. Checksums are easy to forge, hence they cannot be used for message integrity, authenticity checks or data deduplication.

Hash functions

A hash function is an algorithm that maps an arbitrary amount of data such as a short message or a large file to a small value of fixed length. Such values are called hash codes, hash sums or simply hashes.

Hash functions are related to checksums, but they are optimised for a different purpose. While checksums are supposed to protect data integrity by detecting different kinds of data corruption, hash functions are supposed to generate values with low colission probability.

Good hash functions always produce the same hash value for the same input, do not produce the same hash value for two different inputs (collission free), and produce two widely different hash values for inputs that are almost identical (dispersion or randomness). Hash functions are generally very fast with the fastest implementations reaching main memory read speeds. For these reasons they are often used for file fingerprinting and indexing of data. Examples of hash functions that are widely used for various media- and IT-related tasks are MurmurHash and xxHash.

  • MurmurHash
    Year:
    2008
    Digest Size: 32/128 Bit
    Speed: 5.0 GBit/s
  • xxHash
    Year: 2012
    Digest Size: 32, 64 Bit
    Speed: 13.8 GBit/s


Cryptographic hash functions

A cryptographic hash function is a function that maps an arbitrary amount of data to a result of fixed size (hash function) which is assumed to be non-reversible, i.e. its impractical to obtain the input data from the hash value (cryptographic). The output of such one-way hash functions is sometimes called digest. Ideally such functions are fast to compute, it's infeasible to generate the original data from the digest, its infeasible to change the original data without also changing the digest, and its infeasible to find two different inputs that generate the same digest.

In information security such hash functions are widely used for digital signatures and message authentication. Since algorithms are very efficient, secure hashes are also often used for file fingerprinting and secure password storage.

Examples of crytographic hashes used for file fingerprinting are MD5, SHA1, SHA256 and SHA512. Practical collision attacks on MD5 have been demonstrated for off-the-shelf computing hardware. Theoretical attacks on SHA1 are publicly known as well. Hence it is highly risky to still use these for securing sensitive data. It's, however, still OK to use MD5 and SHA1 for random error detection. Cryptographically secure are all functions in the SHA-2 and in the new SHA-3 family.

Hash-based message authentication code

HMAC is a algorithm that uses a cryptographic hash function together with a secret cryptographic key to obtain a message digest that can be used for integrity checks and for message authentication. The strength of a HMAC depends on the strength of the hash function and the quality and size of the key. HMAC-SHA1 and HMAC-SHA256 are frequently used in the TLS protocol to secure messages sent over the public Internet. HMACs can be easily employed for verifying digital documents and media assets.

Checksum Overview

  • MD5
    Year:
    1991
    Digest Size: 128 Bit
    Speed: 665 MBit/s
    Security Level: insecure since 2007
  • SHA1
    Year:
    1995
    Digest Size: 160 Bit
    Speed: 984 MBit/s
    Security Level: usage no longer recommended
  • SHA256
    Year:
    2001
    Digest Size: 256 Bit
    Speed: 436 MBit/s
    Security Level: secure
  • SHA512
    Year:
    2001
    Digest Size: 512 Bit
    Speed: 653 MBit/s
    Security Level: secure
  • SHA3-256
    Year:
    2015
    Digest Size: 256 Bit
    Speed: 700 MBit/s
    Security Level: secure
  • SHA3-512
    Year:
    2015
    Digest Size: 512 Bit
    Speed: 575 MBit/s
    Security Level: secure

Related Articles: