Simple Hash Function
Let's look at a simple hash function. Figure 1 shows a hash function intended to be used for a hash table where the keys are character strings.
The hash value is initialized to zero at the start of the function, then for each character in the string the hash value is multiplied by 33 and the Unicode code point value is added to the hash value. No overflow checking is specified, so in C# the value will just wrap around within the range of System.UInt32 if it gets big enough.
The multiplication by 33 is called the mixing step and the
function f : uint → uint defined by
f(x) = x * 33 where * is the C# multiply-without-overflow operator
is called the mixing function. The constant 33 is chosen arbitrarily.
The addition of the Unicode code point value of the character is
called the combining step and the function
g : uint × uint → uint
defined by g(x, y) = x + y where +
is the C# add-without-overflow operator is called the combining function.
Note that this particular hash function has dubious uniform distribution properties. We'll evaluate that more later.
Hash Function Structure
The preceding hash function was too simple to illustrate all of the structures commonly found in hash functions. Figure 2 shows a more complete example.
- First is the initialization step, where the internal state of the hash function is readied.
- Next, the message is split up into zero or more blocks of equal size, and the internal state of the hash function is updated for each block. In this example, the message is divided into 32-bit blocks. For each block, the combining function is applied to the prior state and the bits from the current block, and then the mixing function is called. This step is repeated for each full-sized block in the message.
- Then the final, partial block is processed, if necessary. The combining step is modified to accomodate the incomplete block.
- Finally, some final processing is done to further randomize the internal state. In this case, the mixing step is applied two more times.
If you examine the hash function from figure 1 again, you'll see that it almost follows this structure. The initialization step is there (the hash value is set to zero), and the message is processed in blocks of 16-bits. There is no post-processing step, or rather there is an "empty" post-processing step.
The main difference is that in figure 1 the mixing step is performed before the combining step for each block. And since there is no post-processing step, it's clear that the bits from the last message block are not processed by the mixing function and therefore do not affect the upper 16 bits of the hash value. If you were to use this hash function for hash table lookup you would want to be sure to use the lower-order bits of the hash value instead of the upper-order bits.
Later we'll examine the mysterious constants in this program, in the initialization step and in the mixing function, to determine whether this is a "good" hash function.
Padding the Last Block
For non-cryptographic use, it's not extremely important how the partial final block is handled. Usually it's OK just to pad the message with 0's. But this is unacceptible for cryptographic use. If all you do is pad with 0's then it is easy to find two messages with the same hash value-- just take a known message with a partial final block and append a zero to it. When the additional 0's are added for padding, the two messages look identitical. If you just pad with 1's instead, then you can add a 1 bit to an existing message.
For this and other reasons, cryptographic hash functions always pad the original message with a number of bits even if the original message was an even block size, and they always incorporate the message length into the hash calculation. The SHA-1 hash function, for example, always pads the message with a 1 bit, and the last 64 bits of the last block contain the message length, in bits. The bits between the final 1 and the length bits are set to 0. This scheme prevents an attacker from devising two different messages that are identical after the padding operation.