2.7. Hashing#
A hash function is a mathematical function that takes an input (of any size) and produces a fixed length output called a hash value.
Note
Another term for “hash value” is “digest”!
As an example, a hash function might take an entire file and generate a short, random sequence of characters. The same file will always produce the same result but even a single bit change will result in a completely different hash a property known as the avalanche effect.
2.7.1. Data Integrity#
Hash functions are primarily used for data integrity checks. That means checking that the data is the same as when it was created, since there is a risk of corruption and tampering when storing and transmitting data.
Corruption or tampering can be checked by comparing the hash of the data before and after storage or transmission. If the hash is different then the data is different.
2.7.2. Bins and Collisions#
Hash functions produce a fixed length hash value and an interpretation of this is that the hash value is the number of a bin or container. The hash function is therefore assigning each input to a particular bin.
Since a hash function produces a fixed-size output but can take infinite possible inputs, there must be different inputs that produce the same hash. When different inputs produce the same hash value we have a collision.
Collisions are inevitable if the hash function allows any length input, however a well designed hash function makes them rare.
2.7.3. Recommended Video#
Code Challenge: Rolling Hash
A rolling hash function applies a mathematical operation to pieces of the data in order.
In this exercise, write a script that calculates a polynomial rolling hash, applied to each character in a string.
Formula
The hash function produces an integer, which is calculated from:
where:
\(s\) is the input string
\(p\) is a prime number we choose
\(m\) is a prime number we choose
Details
Character Values
The characters \(s[0]\) need to be a number. To do this we can use the ASCII value of the character, which we can get using Python with the ord function e.g.
print(ord("A"))
Your script should handle all printable ASCII characters. To get the number value of only printable characters we can subtract the lowest printable character.
s = "A"
value = ord(s) - ord("!") + 1
We add 1 to the result to avoid a value of 0 which would cause problems with the hash e.g. "aaa" would be "000".
Setting p and m
Set \(p\) to 94. This number needs to be as large as the number of characters your script expects. There are 94 printable ASCII characters and the smallest prime after this is 97.
Set \(m\) to 1009. This number controls how much the hash values of each character are spread out.
Example
String to hash: egg
Hash: 939
Extension
Adjust the values of \(p\) and \(m\). As you adjust them, do you observe more collisions or fewer?
Keep in mind that \(m\) must be prime.
Solution
Solution is locked