Re: I'm a total push-over..

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 24 Jan 2008 09:15:43 -0800 (PST)

On Thu, 24 Jan 2008, Dmitry Potapov wrote:
> 
> I don't think you can any meaningful unicode-juggling without converting
> symbols to UCS-4, and after that it makes much more sense to operate
> with uint32 than bytes. So, Jenkins' hash is still relevant, just because
> it does not operate on single bytes, but using uint32.

No, no, no, NO!

Egads! Why do people constantly do these totally idiotic things for 
Unicode?

You can do a perfectly fine 8-bytes-at-a-time hash for almost 100% of all 
source code projects in UTF-8, without *ever* doing any format changes at 
all. Admittedly, it's a lot easier if the hash is a pure in-memory one (ie 
we don't care about byte-order or size of integers or anything like that), 
but that's the common case for most hashes that aren't used for BTree 
lookup on disk or something like that.

Here, let me show you:

	unsigned int name_hash(const char *name, int size)
	{
		hash = HASH_INIT;
		do {
			unsigned char c;
			if (size >= sizeof(long)) {
				unsigned long val = get_unaligned_long(name);
				if (!(val & 0x8080808080808080)) {
					/* Make it equivalent in case */
					val &= ~0x2020202020202020;
					hash = hash_long(hash, val);
					name += sizeof(long);
					size -= sizeof(long);
					continue;
				}
			}

			c = *name;
			if (!(c & 0x80)) {
				hash = hash_long(hash, c & ~0x20);
				name++;
				size--;
				continue;
			}

			/* This is the unusual and slowish case */
			hash = hash_utf8_char(hash, c, &name, &size);
		} while (size);
		return hassh;
	}

and then the only point you ever do that actual UTF8->codepoint conversion 
is for that "high bit set" case.

A few things to note on the above:

 - the hash obviously has "odd" characteristics. We're not necessarily 
   hashing characters at a time at all, and the alignment of characters 
   with high bits *within*the*string* will make a difference to how we 
   hash them.

   But it's also important that the "get_unaligned_long()" means that the 
   alignment of the string itself doesn't matter, so its' purely a 
   "chunking within the string" thing, and the alignment of the string 
   itself won't affect the hash value

 - I'm not writing out hash_utf8_char(), because it's certainly not 
   totally trivial, but it's not *really* complex either. The 
   nontriviality isn't so much the decoding into a codepoint (which is 
   pretty simple), but the fact that when you have the codepoint you 
   should then decompose it and turn it into lower case, which is 
   generally two table lookups. Then, you just do

	for_each_decomposed_uppercased_codepoint(c)
		hash = hash_long(hash, c);

   and one thing to note is that for the hashing, the decomposition and 
   uppercasing doesn't even have to be "exact" (the same way I didn't do 
   an "exact" upper-casing for US-ASCII, just a "good enough" one!)

Similarly, when you actually do a unicode *compare* function, you should 
never *ever* actually convert to any unicode codepoints or do any 
expensive decomposition AT ALL by default! What you do is to compare 
things byte-by-byte, and only convert to unicode/decompse if there are any 
differences, and only for those parts of the sequence that differ!

So if you have two UTF-8 strings (even if they have "complex" characters, 
ie with the  high bit set), the *common* case is that you'll match them 
byte for byte, and they'll match without any Unicode conversion needed at 
all! This is common because:

 - normally, even if you don't ever normalize, people tend to input things 
   in a *similar* manner (again, OS X is the odd man out), so even 
   non-normalized strings are often non-normalized the same way!

 - we're only going to compare things that have hashed to the same thing 
   anyway, so the common case is that it's the same string, and most 
   likely had the same source. And if it's a collision, it's often totally 
   different. And even if it's different only in case, the *common* case 
   is going to be (for source code trees, at least) that the different 
   point is a US-ASCII letter, and the case-comparison will again be done 
   without any Unicode knowledge at all!

This is why normalization of strings before-hand is generally so utterly 
stupid. It doesn't buy you anything. It complicates things a lot (you 
don't want to normalize in-place, so you have memory management issues), 
and it actually SLOWS THINGS DOWN.

It's much better to do UTF-8 comparisons and hashing char-by-char. At 
least if you know that the common case is not going to be the complex part 
of the character set (which is undoubtedly true for source code 
filenames). 

Normalizing things ahead of time *only* makes sense if:

 - you expect complex characters to be a big part of your load

and

 - you're going to do literally *thousands* of comparisons against the 
   *same* strings over and over (so that the cost of normalization is
   literally up-front)

For example, for a filesystem, it's true that you're going to compare 
against the *target* (on-disk) multiple times, but that doesn't actually 
mean that normalizing it makes any sense - because the data you're going 
to compare against comes from user space and isn't guaranteed to be 
normalized, so you still cannot do a simple memcmp() without the expense 
of normalizing that.

And since you're going to hash the filenames anyway, you will not have 
"thousands of comparisons" per source lookup, you'll generally only have a 
couple, so now your normalization actually cost you *more* than doing the 
above on-the-fly comparison!

See?

Basically, it's almost always a stupid thing to actually normalize a whole 
string. You do those things character-by-character, and only lazily when 
you actually need to!

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html