Re: I'm a total push-over..

Luke Lu <git@xxxxxxxxxx> · Wed, 23 Jan 2008 22:51:37 -0800

On Jan 23, 2008, at 6:39 AM, Andreas Ericsson wrote:
Marko Kreen wrote:
On 1/23/08, Andreas Ericsson <ae@xxxxxx> wrote:
Dmitry Potapov wrote:
On Wed, Jan 23, 2008 at 09:32:54AM +0100, Andreas Ericsson wrote:
The FNV hash would be better (pasted below), but I doubt
anyone will ever care, and there will be larger differences
between architectures with this one than the lt_git hash (well,
a function's gotta have a name).
Actually, Bob Jenkins' lookup3 hash is twice faster in my tests
than FNV, and also it is much less likely to have any collision.

>From http://burtleburtle.net/bob/hash/doobs.html
---
FNV Hash

I need to fill this in. Search the web for FNV hash. It's faster  
than my hash on Intel (because Intel has fast multiplication),  
but slower on most other platforms. Preliminary tests suggested  
it has decent distributions.
I suspect that this paragraph was about comparison with lookup2

It might be. It's from the link Dmitry posted in his reply to my  
original
message. (something/something/doobs.html).

(not lookup3) because lookup3 beat easily all the "simple" hashes

By how much? FNV beat Linus' hash by 0.01 microseconds / insertion,
and 0.1 microsecons / lookup. We're talking about a case here where
there will never be more lookups than insertions (unless I'm much
mistaken).

If you don't mind few percent speed penalty compared to Jenkings
own optimized version, you can use my simplified version:
  http://repo.or.cz/w/pgbouncer.git?a=blob;f=src/ 
hash.c;h=5c9a73639ad098c296c0be562c34573189f3e083;hb=HEAD

I don't, but I don't care that deeply either. On the one hand,
it would be nifty to have an excellent hash-function in git.
On the other hand, it would look stupid with something that's
quite clearly over-kill.

It works always with "native" endianess, unlike Jenkins fixed-endian
hashlittle() / hashbig().  It may or may not matter if you plan
to write values on disk.
Speed-wise it may be 10-30% slower worst case (in my case sparc- 
classic
with unaligned data), but on x86, lucky gcc version and maybe
also memcpy() hack seen in system.h, it tends to be ~10% faster,
especially as it does always 4byte read in main loop.

It would have to be a significant improvement in wall-clock time
on a test-case of hashing 30k strings to warrant going from 6 to 80
lines of code, imo. I still believe the original dumb hash Linus
wrote is "good enough".

On a side-note, it was very interesting reading, and I shall have
to add jenkins3_mkreen() to my test-suite (although the "keep
copyright note" license thing bugs me a bit).

Would you, for completeness' sake, please add Tcl and STL hashes to  
your test suite? The numbers are quite interesting. Is your test  
suite available somewhere, so we can test with our own data and  
hardware as well. Both Tcl hash and STL (from SGI probably HP days,  
still the current default with g++) string hashes are extremely  
simple (excluding the loop constructs):

Tcl: h += (h<<3) + c; 	// essentially *9+c (but work better on non- 
late-intels)
STL: h = h * 5 + c;	// worse than above for most of my data

Thanks,

__Luke

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html