Re: I'm a total push-over..

"Marko Kreen" <markokr@xxxxxxxxx> · Fri, 25 Jan 2008 22:52:06 +0200

On 1/24/08, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> You can do a perfectly fine 8-bytes-at-a-time hash for almost 100% of all
> source code projects in UTF-8, without *ever* doing any format changes at
> all. Admittedly, it's a lot easier if the hash is a pure in-memory one (ie
> we don't care about byte-order or size of integers or anything like that),
> but that's the common case for most hashes that aren't used for BTree
> lookup on disk or something like that.
>
> Here, let me show you:
>
>         unsigned int name_hash(const char *name, int size)

Well, although this is very clever approach, I suggest against it.
You'll end up with complex code that gives out substandard results.

I think its better to have separate case-folding function (or several),
that copies string to temp buffer and then run proper optimized hash
function on that buffer.

That way you can use already tested building blocks and can optimize
both sides separately.  Eg. the folding-only function can  aswell be
optimized to load 4 or 8-byte at-a-time.  This also isolates hashing
from exact details how folding happens to access the input string which
seem to be the weak point in your approach.  (In both collision and
complexity sense.)

Such temp buffer happens to fits my lookup3_memcpy also better (heh).
Its weak point is that on platforms that do not allow unaligned access,
it degenerates to byte-by-byte loading.  But if know you always
have aligned buffer, you can notify gcc to do 4-byte fetch there too.
It should be as simple as tagging data pointer as uint32_t *.

Anyway, now you dont need to worry about folding when picking hash.

> Basically, it's almost always a stupid thing to actually normalize a whole
> string. You do those things character-by-character, and only lazily when
> you actually need to!

If your input strings are over kilobyte on average then I'd
agree with you, but if you process 20-30 bytes on average,
is the additional complexity worth it?

-- 
marko
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html