Re: I'm a total push-over..

"Marko Kreen" <markokr@xxxxxxxxx> · Sat, 26 Jan 2008 14:16:29 +0200

On 1/26/08, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Fri, 25 Jan 2008, Marko Kreen wrote:
> > Well, although this is very clever approach, I suggest against it.
> > You'll end up with complex code that gives out substandard results.
>
> Actually, *your* operation is the one that gives substandard results.
>
> > I think its better to have separate case-folding function (or several),
> > that copies string to temp buffer and then run proper optimized hash
> > function on that buffer.
>
> I'm sorry, but you just cannot do that efficiently and portably.
>
> I can write a hash function that reliably does 8 bytes at a time for the
> common case on a 64-bit architecture, exactly because it's easy to do
> "test high bits in parallel" with a simple bitwise 'and', and we can do
> the same with "approximate lower-to-uppercase 8 bytes at a time" for a
> hash by just clearing bit 5.
>
> In contrast, trying to do the same thing in half-way portable C, but being
> limited to having to get the case-folding *exactly* right (which you need
> for the comparison function) is much much harder. It's basically
> impossible in portable C (it's doable with architecture-specific features,
> ie vector extensions that have per-byte compares etc).

Here you misunderstood me, I was proposing following:

int hash_folded(const char *str, int len)
{
   char buf[512];
   do_folding(buf, str, len);
   return do_hash(buf, len);
}

That is - the folded string should stay internal to hash function.

Only difference from combined foling+hashing would be that
you can code each part separately.

> And hashing is performance-critical, much more so than the compares (ie
> you're likely to have to hash tens of thousands of files, while you will
> only compare a couple). So it really is worth optimizing for.
>
> And the thing is, "performance" isn't a secondary feature. It's also not
> something you can add later by optimizing.
>
> It's also a mindset issue. Quite frankly, people who do this by "convert
> to some folded/normalized form, then do the operation" will generally make
> much more fundamental mistakes. Once you get into the mindset of "let's
> pass a corrupted strign around", you are in trouble. You start thinking
> that the corrupted string isn't really "corrupt", it's in an "optimized
> format".
>
> And it's all downhill from there. Don't do it.

Againg, you seem to keep HFS+ behaviour in mind, but that was
not what I did suggest.  Probably my mistake, sorry.

-- 
marko
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html