On 1/26/08, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > On Fri, 25 Jan 2008, Marko Kreen wrote: > > Well, although this is very clever approach, I suggest against it. > > You'll end up with complex code that gives out substandard results. > > Actually, *your* operation is the one that gives substandard results. > > > I think its better to have separate case-folding function (or several), > > that copies string to temp buffer and then run proper optimized hash > > function on that buffer. > > I'm sorry, but you just cannot do that efficiently and portably. > > I can write a hash function that reliably does 8 bytes at a time for the > common case on a 64-bit architecture, exactly because it's easy to do > "test high bits in parallel" with a simple bitwise 'and', and we can do > the same with "approximate lower-to-uppercase 8 bytes at a time" for a > hash by just clearing bit 5. > > In contrast, trying to do the same thing in half-way portable C, but being > limited to having to get the case-folding *exactly* right (which you need > for the comparison function) is much much harder. It's basically > impossible in portable C (it's doable with architecture-specific features, > ie vector extensions that have per-byte compares etc). Here you misunderstood me, I was proposing following: int hash_folded(const char *str, int len) { char buf[512]; do_folding(buf, str, len); return do_hash(buf, len); } That is - the folded string should stay internal to hash function. Only difference from combined foling+hashing would be that you can code each part separately. > And hashing is performance-critical, much more so than the compares (ie > you're likely to have to hash tens of thousands of files, while you will > only compare a couple). So it really is worth optimizing for. > > And the thing is, "performance" isn't a secondary feature. It's also not > something you can add later by optimizing. > > It's also a mindset issue. Quite frankly, people who do this by "convert > to some folded/normalized form, then do the operation" will generally make > much more fundamental mistakes. Once you get into the mindset of "let's > pass a corrupted strign around", you are in trouble. You start thinking > that the corrupted string isn't really "corrupt", it's in an "optimized > format". > > And it's all downhill from there. Don't do it. Againg, you seem to keep HFS+ behaviour in mind, but that was not what I did suggest. Probably my mistake, sorry. -- marko - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html