Estimating Hash Collision Rates

It’s inconceivable that a blog titled “Hashed Potatoes” never had a post about hashes for the past few years. That changes now. Potatoes neither; but another time, perhaps?

Suppose you have a registration system, and you want to set up a hashing scheme to generate a unique ID for each user. You want this ID to be numeric (base 10), but you don’t know how many digits you should assign to it. You do, however, have an estimate of the maximum number of users that would be registering. How can you obtain a sensible lower bound to the number of digits needed?

random oracles

Hashing functions are designed to emulate the function of random oracles. A random oracle is an idealised black-box construct that behaves according to the following rules:

A random oracle is a function that takes an object from some domain $A$ and maps it to an object in a co-domain $B$ .
Given an input it has not received before, a random oracle will randomly select an object in $B$ according to a uniform distribution to be the output.
Given an input it has received before, a random oracle will output the same result it had outputted when it previously received that input.

This way, random oracles are designed to be irreversible. The only attack it should be susceptible to is to enumerate through every possible input until the desired output is obtained, i.e. a brute-force attack.

While the property of irreversibility is very useful, random oracles do not guarantee the property of injection. What this means is that it is possible for a valid random oracle to map two different inputs to the same output. This is a hash collision.

hash collisions

It would be horrible if a hash collision were to occur in our registration system, because two different people could end up with the same ID. That would allow one to impersonate the other in our system, and access information only privy to the other. However, due to the random nature of how the output is selected, there is always a nonzero probability of a hash collision.

Let us define the following quantities for convenience:

The maximum expected number of users is $k$ , i.e. the domain size is $k=\left|A\right|$ . The domain here is the application domain, which is a subset of the hash domain.
The number of possible hash outputs is $n$ , i.e. the co-domain size is $n=\left|B\right|$ .
We want to fix the probability of a hash collision as $p$ .

With these definitions, our question now becomes: given $k$ and $p$ , how can we determine $n$ ?

a qualitative look

Before we jump into slogging through the math, let’s first make some qualitative observations.

Notice that if $n<k$ , we are guaranteed to have a hash collision. By the pigeonhole principle, if we assign a unique slot to each of the first $n$ inputs, the next input must use an already existing hash. In mathematical terms, $\forall n<k:p=1$ .

Notice also that if $p$ increases for the same $k$ , that means that we are more lenient, so we don’t need so many hash slots, thus $n$ decreases. In mathematical terms, $\forall p\in(0,1):\frac{\partial n}{\partial p}<0$ .

If $k$ increases for the same $p$ , that means that there will be more users, so there is another possible point of failure, thus $n$ increases. In mathematical terms, $\forall k\in\mathbb{N}:\frac{\partial n}{\partial k}>0$ .

wrangling probabilities

Let’s start by deriving an expression for $p$ in terms of $n$ and $k$ . To have no hash collision, each of the $k$ inputs must fill a different slot out of $n$ total slots.

To analyse this, let’s assign each input to a slot sequentially:

The first input can be assigned any slot, and there will be no hash collision. The probability of the first input being assigned any slot is $1$ . We can rewrite this as $\frac{n}{n}$ (you’ll see why later).
The second input can be assigned any slot except the first input’s. The probability of the second input being assigned any slot except the first input’s is $\frac{n-1}{n}$ .
The $i$ th input can be assigned any slot except those belonging to the first $i-1$ inputs. The probability of that is $\frac{n-i+1}{n}$ .

We need the condition for all inputs to be true, thus we need to apply an AND operation (multiplication) between all conditions to get the probability that there is no hash collision for $k$ inputs.

$1-p=\dfrac{n}{n}\dfrac{n-1}{n}\cdots\dfrac{n-k+1}{n}=\dfrac{n!}{(n-k)!\,n^k}$

It is here that we run into our first issue. This expression is made complicated by the presence of factorials. How can we actually solve for $n$ ? As exponentials and factorials are multiplicative, let’s first take the logarithm of both sides.

$\ln\left(1-p\right)=\ln n!-\ln (n-k)!-k\ln n\\=\sum_{i=1}^n\ln i-\sum_{j=1}^{n-k}\ln j-k\ln n$

At this point, we can convert the factorials into summations, but there is no further way to simplify this expression. We are stuck and there is no way to proceed further.

(Just kidding.)

There is indeed no way to simplify the above expression further, but let’s take a moment to analyse the expression above. In particular, let’s look at the case where $n$ is high enough, such that $\sum_{i=1}^n\ln i$ is too expensive to evaluate. For example, if $n=10^8$ , and each evaluation takes 1 ms, you’d need about 1 day to obtain the result. In this case, we need to rely on an approximation to obtain the result.

stirling’s approximation

Ultimately, there are two terms we need to approximate: $\sum_{i=1}^n\ln i$ and $\sum_{j=1}^{n-k}\ln j$ . The tool we shall use is Stirling’s approximation.

We’ll make an important observation here: $\sum_{i=1}^n\ln i$ is the Riemann sum of $\int_1^n \ln x\,dx$ . As $n$ increases, the integral becomes more accurate at approximating the sum. The solution for the integral is:

$\int_1^n \ln x\,dx=n\ln n-n+1$

Let’s take this a step further and analyse the integral separately for now. Let’s take apply the trapezoidal rule to obtain a more accurate approximation:

$\int_1^n \ln x\,dx\approx\frac{1}{2}\ln 1+\sum_{i=2}^{n-1}\ln i+\frac{1}{2}\ln n$

$\sum_{i=1}^n\ln i\approx\left(n+\frac{1}{2}\right)\ln n-n+1$

Now that we have our approximation, let’s substitute this back into the original equation and simplify to get a very clean result:

$\ln\left(1-p\right)\approx-\left(n-k+\frac{1}{2}\right)\ln\left(1-\frac{k}{n}\right)-k$

This is as far as we’re going to get with Stirling’s approximation, but there’s still no clear way to solve for $n$ yet.

taylor series

Now, we can simplify the expression even further by adding an assumption. Let’s assume that we want the hash collision probability $p$ to be very small, to the extent where $n\gg k$ , so $\frac{k}{n}\approx 0$ .

We can then apply the Taylor Series to obtain a first-order approximation of the right-hand side. We will include two terms in the series as there will be a cancellation of the most significant term.

$\ln\left(1-p\right)\approx-\left(n-k+\frac{1}{2}\right)\left(\frac{k}{n}+\frac{k^2}{2n^2}\right)-k$

Expanding the right-hand side and ignoring the terms in $\frac{1}{n^2}$ :

$\ln\left(1-p\right)\approx\frac{-k(k-1)}{2n}$

Rearranging, we can now solve for $n$ :

$n\approx\frac{-k(k-1)}{2\ln\left(1-p\right)}$

If $p$ is very small, we can simplify this one step further:

$n\approx\frac{k(k-1)}{2p}$

observations

This is a remarkably simple result, and we can make the following observations:

$n$ scales with $k^2$ and $\frac{1}{p}$ .
$\frac{k(k-1)}{2}$ is $k\choose 2$ , or the $(k-1)$ th triangular number.

Let’s get a physical intuition of these figures:

$\frac{1}{p}$ is the mean number of hashes for 1 collision to occur.
$k\choose 2$ is the number of ways to choose any 2 users out of all the users.
If $n$ is large enough and $p$ is small enough, we can assume that the likelihood of 3 or more users having the same hash is negligible, when compared to just 2 users having the same hash.

Thus, under these assumptions, the probability of a collision is just the number of ways to choose any 2 users divided by the number of possible hash outputs.

conclusion

What does this mean for us when setting up a hashing scheme? Well, several things.

We can obtain the required number of digits as $\left\lceil\log_{10}n\right\rceil$ .
Notice that we made no assumptions on the radix in our calculations. Instead of base 10, we can use any other base. E.g. base 16 for hex, base 36 for alphanumeric, base 62 for case-sensitive alphanumeric.
If we can include a token that splits the domain by a factor of $q$ , the required co-domain size is reduced by a factor of approximately $q^2$ . As such, it is best to include as many of such tokens as possible, while being careful to avoid exposing any sensitive keys.