Re: I'm a total push-over..

Andreas Ericsson <ae@xxxxxx> · Thu, 24 Jan 2008 11:24:53 +0100

Luke Lu wrote:
On Jan 23, 2008, at 6:39 AM, Andreas Ericsson wrote:
Marko Kreen wrote:
On 1/23/08, Andreas Ericsson <ae@xxxxxx> wrote:
Dmitry Potapov wrote:
On Wed, Jan 23, 2008 at 09:32:54AM +0100, Andreas Ericsson wrote:
The FNV hash would be better (pasted below), but I doubt
anyone will ever care, and there will be larger differences
between architectures with this one than the lt_git hash (well,
a function's gotta have a name).
Actually, Bob Jenkins' lookup3 hash is twice faster in my tests
than FNV, and also it is much less likely to have any collision.

>From http://burtleburtle.net/bob/hash/doobs.html
---
FNV Hash

I need to fill this in. Search the web for FNV hash. It's faster 
than my hash on Intel (because Intel has fast multiplication), but 
slower on most other platforms. Preliminary tests suggested it has 
decent distributions.
I suspect that this paragraph was about comparison with lookup2

It might be. It's from the link Dmitry posted in his reply to my original
message. (something/something/doobs.html).

(not lookup3) because lookup3 beat easily all the "simple" hashes

By how much? FNV beat Linus' hash by 0.01 microseconds / insertion,
and 0.1 microsecons / lookup. We're talking about a case here where
there will never be more lookups than insertions (unless I'm much
mistaken).

If you don't mind few percent speed penalty compared to Jenkings
own optimized version, you can use my simplified version:

http://repo.or.cz/w/pgbouncer.git?a=blob;f=src/hash.c;h=5c9a73639ad098c296c0be562c34573189f3e083;hb=HEAD 

I don't, but I don't care that deeply either. On the one hand,
it would be nifty to have an excellent hash-function in git.
On the other hand, it would look stupid with something that's
quite clearly over-kill.

It works always with "native" endianess, unlike Jenkins fixed-endian
hashlittle() / hashbig().  It may or may not matter if you plan
to write values on disk.
Speed-wise it may be 10-30% slower worst case (in my case sparc-classic
with unaligned data), but on x86, lucky gcc version and maybe
also memcpy() hack seen in system.h, it tends to be ~10% faster,
especially as it does always 4byte read in main loop.

It would have to be a significant improvement in wall-clock time
on a test-case of hashing 30k strings to warrant going from 6 to 80
lines of code, imo. I still believe the original dumb hash Linus
wrote is "good enough".

On a side-note, it was very interesting reading, and I shall have
to add jenkins3_mkreen() to my test-suite (although the "keep
copyright note" license thing bugs me a bit).

Would you, for completeness' sake, please add Tcl and STL hashes to your 
test suite?

I could do that. Or I just publish the entire ugly thing and let someone
else add them ;-)

The numbers are quite interesting. Is your test suite 
available somewhere, so we can test with our own data and hardware as 
well.

Not yet, no. I usually munge it up quite a lot when I want to test hashes
for a specific input, so it's not what anyone would call "pretty".

Both Tcl hash and STL (from SGI probably HP days, still the 
current default with g++) string hashes are extremely simple (excluding 
the loop constructs):

Tcl: h += (h<<3) + c;     // essentially *9+c (but work better on 
non-late-intels)
STL: h = h * 5 + c;    // worse than above for most of my data

They sure do look simple enough. As for loop constructs, I've tried to
use the same looping mechanics for everything, so as to let the algorithm
be the only difference. Otherwise it gets tricky to do comparisons. The
exceptions are ofcourse hashes relying on Duff's device or similar
alignment trickery.

--
Andreas Ericsson                   andreas.ericsson@xxxxxx
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html