On 1/27/08, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > On Sat, 26 Jan 2008, Marko Kreen wrote: > > > > Here you misunderstood me, I was proposing following: > > > > int hash_folded(const char *str, int len) > > { > > char buf[512]; > > do_folding(buf, str, len); > > return do_hash(buf, len); > > } > > > > That is - the folded string should stay internal to hash function. > > If it's internal, it's much better, but you still missed the performance > angle. > > The fact is, hashing can take shortcuts that folding cannot do! > > Case folding, by definition, has to be "exact" (since the whole point is > what you're going to use the same folding function to do the compare, so > if you play games with folding, the compares will be wrong). > > But hashing doesn't have to be exact. It's ok to hash '{' and '[' as if > they were different cases of the same character, if that gives you a > faster hash function. Especially as those charactes are rather rare in > filenames. > > So if you do hashing as a function of its own, you can simply do a better > job at it. > > I do agree that the functions that create a folded set of characters from > a _complex_ UTF-8 character should be shared between folding and hashing, > since that code is too complex and there are no simple shortcuts for doing > a faster hash that still retains all the properties we want. Well, you can always have fold_quick_and_dirty() function that is used only internally in hash_folded() function, which can: - fold with simple |= 0x20202020.. - write out full uint32/64, no need to make result proper string - zero-fill at the end, so hash function does not need to check for partial block, which is pretty expensive part of hashing. The win would be: - more modularized code - can use faster/any hash - hash function can be certain to work on aligned data (win on non-x86) The minus: - some memory i/o overhead which may or may not matter - the parts would not be fully generic, but special to hashing -- marko PS. Typo in last mail - "inner loop should be reversible - that means that details from beginning of data should shift out of horizon." That obviously means "data should _not_ shift out of horizon. btw, "reversible" for integer hashes means that there is 1:1 mapping between input and output - no collisions. Thus no info loss. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html