Collisions of Hash or Identifier Generation

About

A collision happens when a function produces the same result while it should be unique.

Calculation

All calculation of collisions are based on the birthday problem 1).

The birthday problem tries to find the chance of two persons having:

  • the same birthday (over a set of 365 days)
  • in a classroom (over a set of N classmate)

If you translate:

  • the same birthday by a collision
  • the number of day in a year by the possible number of generated hash or id (m)
  • the number of classmate by the number of generated values (n)

you can adapt the formula.

If <math>n=m</math> , the probability is 100% (ie if you have 365 students, you are sure to have two students with the same birthday)

Applied to a collision problem, if we use the square approximation 2) (works well for probability under <math>\frac{1}{2}</math> ), we get the number of value to generate n to have the probability p(n) is: <MATH> n \approx \sqrt{2m \times p(n)} \approx \sqrt{2 \times \text{number of possible generated values} \times p(n)} </MATH> If we create a random identifier of length N in a set of k possible characters, m can be calculated with the permutation formula <MATH> m = N^k </MATH> The applied formula is then: <MATH> n \approx \sqrt{2 \times N^k \times p(n)} </MATH> or for the length of the identifier k (after a little bit of math) <MATH> k = log_N(\frac{n^2}{2 \times p(n)}) </MATH>

The below forms applied this formula and calculates the identifier length needed for a random generated identifier

Type

Hash

An hash collision happens when two different inputs produce the same hash result.

It may happen because the hash is just a number in a possible range of value. A good hash function should generate in this range values that should be generated uniformly (ie almost randomly)

You can use the previous collision probability formula for any hash.

For instance, for for the md5,

Below is a table of the odds of a hash collision when you know the number of value to hash 3). You get for sure a collision with a probability of 100% (1 on 1)

You can read the below diagram as:

  • for a hash with a length of 64-bit
  • you need 192 million input
  • to have a chance of 1 / 1000 to get a collision (four-of-kind in poker)

Id

An id collision happens when the random identifier generator produce the same id twice.

Estimation

Resolution

If the hash value is used to store data, the duplicate data can be stored at the same location and a lookup should be performed on a real unique identifier. Hash_table Collision_resolution


Powered by ComboStrap