Linus Torvalds wrote:
NOTE NOTE NOTE! The current hash is a joke.
Insofar as hashes go, it's not that shabby for hashing filenames. Here's the test output from a small hash-comparison program I've got, which runs the test-input through a series of different hashes to compare dispersion, collisions, lookup- and insert times and other things that are interesting from a practical PoV. Fowler/Noll/Vo cheap hash Collisions: 1829, 7.91%. Depth max: 3, average: 0.09. Time spent in lookups: 167.859ms. Per entry: 0.145us Time spent inserting: 2.753ms. Per entry: 0.119us PHP Zend cheap hash Collisions: 1908, 8.25%. Depth max: 3, average: 0.09. Time spent in lookups: 171.819ms. Per entry: 0.149us Time spent inserting: 2.778ms. Per entry: 0.120us Phong Vo's linear congruential hash Collisions: 1996, 8.63%. Depth max: 3, average: 0.09. Time spent in lookups: 168.276ms. Per entry: 0.146us Time spent inserting: 2.840ms. Per entry: 0.123us Phong Vo's second linear congruential hash Collisions: 1933, 8.36%. Depth max: 3, average: 0.09. Time spent in lookups: 170.416ms. Per entry: 0.147us Time spent inserting: 2.774ms. Per entry: 0.120us Glib string hash Collisions: 1907, 8.24%. Depth max: 3, average: 0.09. Time spent in lookups: 192.420ms. Per entry: 0.166us Time spent inserting: 3.154ms. Per entry: 0.136us sdbm hash Collisions: 1899, 8.21%. Depth max: 3, average: 0.09. Time spent in lookups: 170.797ms. Per entry: 0.148us Time spent inserting: 2.724ms. Per entry: 0.118us bfd hash Collisions: 1949, 8.43%. Depth max: 3, average: 0.09. Time spent in lookups: 206.504ms. Per entry: 0.179us Time spent inserting: 3.241ms. Per entry: 0.140us Linus' GIT hash Collisions: 1946, 8.41%. Depth max: 4, average: 0.09. Time spent in lookups: 179.336ms. Per entry: 0.155us Time spent inserting: 2.781ms. Per entry: 0.120us This is with 64K buckets, which (with my implementation) means a total hash-table size of 256KiB. The test-input is a simple file-listing from the linux kernel (although it has the .git directory included too). As you can see, it loses out on mathematical correctness, as it has more collisions (but not that many). OTOH, the simplicity of the implementation makes it a viable option anyway, since the time spent in insertion (which is almost completely inside the hash-function) is on average so much shorter. For a temporary hash-table, it will work splendidly. It will probably have issues scaling to, say, 30 or 40 million input lines (fairly typical test-data for database hashes, fe). Note that real-world timings will be shorter. My test-program doesn't have the luxury of keeping hash-functions inline, so some compiler optimizations become impossible. This is on a Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz (according to /proc/cpuinfo), with a cache size of 4meg. The use of multiplication is sane though, as it means no arch will suffer greatly. The FNV hash would be better (pasted below), but I doubt anyone will ever care, and there will be larger differences between architectures with this one than the lt_git hash (well, a function's gotta have a name). /* * Fowler/Noll/Vo hash * * The basis of the hash algorithm was taken from an idea sent by email to the * IEEE Posix P1003.2 mailing list from Phong Vo (kpv@xxxxxxxxxxxxxxxx) and * Glenn Fowler (gsf@xxxxxxxxxxxxxxxx). Landon Curt Noll (chongo@xxxxxxxx) * later improved on their algorithm. * * The magic is in the interesting relationship between the special prime * 16777619 (2^24 + 403) and 2^32 and 2^8. * * This hash produces the fewest collisions of any function that we've seen so * far, and works well on both numbers and strings. * * (Last comment from MySQL code) * */ u32 FNV1(u8 *k, u32 len) { u8 *e; u32 h; e = k + len; for (h = 0; k < e; k++) { h *= 16777619; h ^= *k; } return (h); } I could provide figures for other table-sizes too, if anyone's interested. -- Andreas Ericsson andreas.ericsson@xxxxxx OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html