On Tue, Apr 4, 2023 at 11:19 AM Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > Limiting yourself to US-ASCII is at least technically valid. Because > EBCDIC isn't worth worrying about. But when the high bit is set, you > had better not touch it, or you need to limit it spectacularly. Side note: if limiting it to US-ASCII is fine (and it had better be, because as mentioned, anything else will result in unresolvable problems), you might look at using this as the pre-hash function: unsigned char prehash(unsigned char c) { unsigned char mask = (~(c >> 1) & c & 64) >> 1; return c & ~mask; } which does modify a few other characters too, but nothing that matters for hashing. The advantage of the above is that you can trivially vectorize it. You can do it with just regular integer math (64 bits = 8 bytes in parallel), no need to use *actual* vector hardware. The actual comparison needs to do the careful thing (because '~' and '^' may hash to the same value, but obviously aren't the same), but even there you can do a cheap "are these 8 characters _possibly_ the same) with a very simple single 64-bit comparison, and only go to the careful path if things match, ie /* Cannot possibly be equal even case-insentivitely? */ if ((word1 ^ word2) & ~0x2020202020202020ul) continue; /* Ok, same in all but the 5th bits, go be careful */ .... and the reason I mention this is because I have been idly thinking about supporting case-insensitivity at the VFS layer for multiple decades, but have always decided that it's *so* nasty that I really was hoping it just is never an issue in practice. Particularly since the low-level filesystems then inevitably decide that they need to do things wrong and need a locale, and at that point all hope is lost. I was hoping xfs would be one of the sane filesystems. Linus