On Mon, Mar 28, 2011 at 10:28 PM, Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> wrote: > On 03/14/2011 09:25 PM, Merlin Moncure wrote: > >>> Unless the point is to guarantee uniqueness of the "long-long value"s. >> >> md5 will do that too: the main thing you lose going to hash indexing >> is ordering. > > MD5 will *probably* guarantee the uniqueness of the values. Personally I'm > not a big fan of betting on that, and tend to like to test for equality > against the hash first and then, if the hashes are equal, against the > values. > > I have several md5 collisions in one of my tables, so it's far from > impossible. In fact, it's much more likely than you'd expect thanks to the > birthday paradox. I'll agree that 128 bit key is just on the cusp of discomfort if you have a large data set and/or want to be really safe from random collision, but any larger key and it's just worry for the sake of worrying. You mentioned the birthday attack -- let's consult the chart. For MD5 (128 bits), if you are ok with 10^-15 chance of *one* collision (odds 1 in 1,000,000,000,000,000), you can safely hash up to 100,000,000,000 unique values. Of course, md5 is not perfectly distributing so the numbers are not in fact that good, but those are pretty tall odds. If you are concerned about safety, you are much better off jumping to sha-1 (or sha-256 if you're really nervous) than comparing the base string, since your chance of random bump is now well below a number of other low probability events, such as a drive bit error rate (meaning, your base string comparison is useless) or getting hit by a comet. Now, if you're worried about intentionally forged data, that's different, but it doesn't sound like that's the case here. When NIST hash competition resolves (my money is on skein), you'll likely be able to to just pick an output size that matches your requirements. merlin -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general