Re: [RFC] adding support for md5

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On Fri, 18 Aug 2006, Jon Smirl wrote:

> If I have two repositories each with 100M objects in them and I merge 
> them, what is the probability of a object id collision with MD5 (128b) 
> versus SHA1 (160b)?

Assuming a uniform distribution of the hashes over our data, this is the 
birthday problem:

http://mathworld.wolfram.com/BirthdayProblem.html

(In short, given a number of days in the year, how many people do I need 
to pick randomly until at least two of them have the same birthday?)

In our case, we want to know how many objects we need in order to probably 
have a clash in 2^128 (approx. 3.4e38) and 2^160 (approx. 1.5e48) hashes, 
respectively.

Mathworld tells us that a good approximation of the probability is

p = 1 - (1-n/(2d))^(n-1)

where n is the number of objects, and d is the total number of hashes. If 
you have 100M = 1e5 objects, you probably want the probability of a clash 
below 1/1e5 = 1e-5, so let's take 1e-10. Assuming n is way lower than d, 
we can approximate

p = 1 - (1 - (n - 1 over 1) * n/(2d)) = n(n-1)/2d

and therefore (approximately)

n = sqrt(2pd)

which amounts to 2.6e14 in the case of a 128-bit hash, and 1.7e19 in the 
case of a 160-bit hash, both well beyond your 100M objects. BTW the 
addressable space of a 64-bit processor is about 1.9e19.

If you want to know the probability of a clash, you can use the same 
approximation:

For 100M objects: p = 1.5e-59 for 128-bit, and p = 3.3e-69 for 160-bit. 
This is so low as to be incomprehensible.

Remember that all these approximations are really crude, so do not rely on 
the precise numbers. But they'll give you good ballpark figures (if I did 
not make a mistake...).

Hth,
Dscho

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]