Double-hashing and The Birthday Problem

The birthday problem (also called the birthday paradox) deals with the probability, that in a set of n randomly selected people, at least two people share the same birthday.

Though it is not technically a paradox, it is often referred to as such because the probability is counter-intuitively high.

The birthday problem is an answer to the following question:

In a set of n randomly selected people, what is the probability, that at least two people 
share the same birthday?
What is the smallest value of n where the probability is at least 50% or 99%?

Let p(n) be the probability that at least two of a group of n randomly selected people share the same birthday. By the pigeonhole principle, since there are 366 possibilities for birthdays (including February 29), it follows that when n≥367, p(n)=100%. The counterintuitive part of the answer is that for smaller n, the relationship between n and p(n) is (very) non-linear.

In fact, the thresholds to surpass 50% and 99% are quite small: 50% probability is reached with just 23 people and 99% with just 70 people.

The probability of collision

Let’s see why this is true by first deriving analytically the probability that no collision occurs for a generic case where there are m possible values, and we observe n of them. The birthday paradox above is a special case where m=365 and n=23. We want to know: what is the probability that we don’t see any collisions after n observations?

For larger numbers, we see that collisions happen very fast: in a space of 100000 possible values, we reach 50% probability after less than 400 observations. This should underscore the importance of considering birthday attacks especially in the case even where the space of possible values is very large. We can see empirically from these couple of cases that the number of observations n you need to get a probability of collision of 0.5 is approximately √m. You can prove that to yourself analytically by taking the equation we derived above for the probability of collision, setting the left hand side equal to 0.5, and rearranging to see the relationship between n and m.

Most cryptosystems use some kind of hash function to process messages. A hash function f is not injective, but is created so that collisions, or instances where x not equal y, f(x)=f(y), are hard to discover. An attacker who can find collisions can access information or messages that are not meant to be public. The birthday attack is a restatement of the birthday paradox that measures how collision-resistant a well-chosen hash function is. For instance, suppose that a hash function is chosen with a 64-bit range; that is, its image is a nonnegative integer less than 2 in power of 64 (2^64). How many values does an attacker need to compute before the probability of a collision is greater than 50%?

√m will give a 2^32, which represents a significant improvement on the sizes involved. For instance, 2^32 is roughly 4 times 10^9, while 2^64 is greater than 10^19. This vulnerability necessitates the use of a large hash range in practical applications.

Digital signature susceptibility

Digital signatures can be susceptible to birthday attacks. A message m is typically signed by first computing H(m), where H is a cryptographic hash function, and then using some secret key to sign H(m). Suppose Alice wants to trick Bob into signing a fraudulent contract.

Alice prepares a fair contract m and fraudulent one m’. She then finds a number of positions, where m can be changed without changing the meaning, such as inserting commas, empty lines, one versus two spaces after a sentence, replacing synonyms, etc.

By combining these changes she can create a huge number of variations on m, which are all fair contracts. Similarly, Alice can also make some of these changes on m’ to take it, even more, closer towards m, that is H(m) = H(m’).

Hence, Alice can now present the fair version m to Bob for signing. After Bob has signed, Alice takes the signature and attaches to it the fraudulent contract. This signature proves that Bob has signed the fraudulent contract.

Bitcoin uses double-hash, or SHA256^2, to overcome the birthday problem, which unfortunately not the case

If you double-hash a value, birthday attacks remain. If you have two values that hash to the same result, forming a birthday attack or collision, they will still collide. It is simple to show. It is assumed that we have two values X and Y. For a given hash function, we have a collision such that:

A == Hash(X) == Hash(Y).

Now, given the result that both X and Y hash to the same value, A, we can easily see that no double-hashing or squared-hashing process will help as:

Hash(A) == Hash[Hash(X)] == Hash[Hash(Y)].

In fact, we lose one bit of information for the additional hash operation. We can say that if we iterate a hash n times, it makes it n times as likely that a collision will occur. I am taking some liberty here, and the maths involved in what I’ve explained is not completely accurate, but it is true that for each time we rehash a function using the same hash function, we lose collision security for the function. In fact, if we look at how addresses in Bitcoin are created, we see that the double-hashing function increases the effect even further. In other words, the hash of the hash in the scenario is more likely to lead to a collision than a single hash or even the hash of the same hash function (a double hash).

Ferguson and Schneier (Practical Cryptography) proposed using a double-hash function as a means to defend against “length-extension” attacks with SHA-256, and they named it SHA-256d. Merkle-Damgård hash functions such as MD5 or SHA-256 are vulnerable to such an attack. If it was an issue, we could have used an HMAC, which is proven by Mihir Bellare is a PRF under the sole assumption that the compression function is a PRF. Therefore, HMAC-MD5 does not suffer from the same weaknesses that have been found in MD5.

Conclusion

To avoid such an attack the output of the hash function should be a very long sequence of bits such that the birthday attack now becomes computationally infeasible, or use double-hashing, but only of different functions.

Deep Research and Development