Re: I'm a total push-over..

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Linus Torvalds wrote:

NOTE NOTE NOTE! The current hash is a joke.


Insofar as hashes go, it's not that shabby for hashing filenames.
Here's the test output from a small hash-comparison program I've
got, which runs the test-input through a series of different hashes
to compare dispersion, collisions, lookup- and insert times and
other things that are interesting from a practical PoV.

Fowler/Noll/Vo cheap hash
Collisions: 1829, 7.91%. Depth max: 3, average: 0.09.
Time spent in lookups: 167.859ms. Per entry: 0.145us
Time spent inserting: 2.753ms. Per entry: 0.119us

PHP Zend cheap hash
Collisions: 1908, 8.25%. Depth max: 3, average: 0.09.
Time spent in lookups: 171.819ms. Per entry: 0.149us
Time spent inserting: 2.778ms. Per entry: 0.120us

Phong Vo's linear congruential hash
Collisions: 1996, 8.63%. Depth max: 3, average: 0.09.
Time spent in lookups: 168.276ms. Per entry: 0.146us
Time spent inserting: 2.840ms. Per entry: 0.123us

Phong Vo's second linear congruential hash
Collisions: 1933, 8.36%. Depth max: 3, average: 0.09.
Time spent in lookups: 170.416ms. Per entry: 0.147us
Time spent inserting: 2.774ms. Per entry: 0.120us

Glib string hash
Collisions: 1907, 8.24%. Depth max: 3, average: 0.09.
Time spent in lookups: 192.420ms. Per entry: 0.166us
Time spent inserting: 3.154ms. Per entry: 0.136us

sdbm hash
Collisions: 1899, 8.21%. Depth max: 3, average: 0.09.
Time spent in lookups: 170.797ms. Per entry: 0.148us
Time spent inserting: 2.724ms. Per entry: 0.118us

bfd hash
Collisions: 1949, 8.43%. Depth max: 3, average: 0.09.
Time spent in lookups: 206.504ms. Per entry: 0.179us
Time spent inserting: 3.241ms. Per entry: 0.140us

Linus' GIT hash
Collisions: 1946, 8.41%. Depth max: 4, average: 0.09.
Time spent in lookups: 179.336ms. Per entry: 0.155us
Time spent inserting: 2.781ms. Per entry: 0.120us

This is with 64K buckets, which (with my implementation) means
a total hash-table size of 256KiB. The test-input is a simple
file-listing from the linux kernel (although it has the .git
directory included too).

As you can see, it loses out on mathematical correctness, as it
has more collisions (but not that many). OTOH, the simplicity
of the implementation makes it a viable option anyway, since
the time spent in insertion (which is almost completely inside
the hash-function) is on average so much shorter.

For a temporary hash-table, it will work splendidly. It will
probably have issues scaling to, say, 30 or 40 million input
lines (fairly typical test-data for database hashes, fe).

Note that real-world timings will be shorter. My test-program
doesn't have the luxury of keeping hash-functions inline, so
some compiler optimizations become impossible.

This is on a Intel(R) Core(TM)2 Duo CPU T7700  @ 2.40GHz
(according to /proc/cpuinfo), with a cache size of 4meg. The
use of multiplication is sane though, as it means no arch
will suffer greatly.

The FNV hash would be better (pasted below), but I doubt
anyone will ever care, and there will be larger differences
between architectures with this one than the lt_git hash (well,
a function's gotta have a name).

/*
* Fowler/Noll/Vo hash
*
* The basis of the hash algorithm was taken from an idea sent by email to the
* IEEE Posix P1003.2 mailing list from Phong Vo (kpv@xxxxxxxxxxxxxxxx) and
* Glenn Fowler (gsf@xxxxxxxxxxxxxxxx).  Landon Curt Noll (chongo@xxxxxxxx)
* later improved on their algorithm.
*
* The magic is in the interesting relationship between the special prime
* 16777619 (2^24 + 403) and 2^32 and 2^8.
*
* This hash produces the fewest collisions of any function that we've seen so
* far, and works well on both numbers and strings.
*
* (Last comment from MySQL code)
*
*/
u32 FNV1(u8 *k, u32 len)
{
	u8 *e;
	u32 h;

	e = k + len;
	for (h = 0; k < e; k++) {
		h *= 16777619;
		h ^= *k;
	}

	return (h);
}


I could provide figures for other table-sizes too, if anyone's
interested.

--
Andreas Ericsson                   andreas.ericsson@xxxxxx
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux