On Tue, Apr 4, 2023 at 10:07 AM Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > + if (c >= 0xc0 && c <= 0xd6) /* latin A-O with accents */ > + return true; > + if (c >= 0xd8 && c <= 0xde) /* latin O-Y with accents */ > + return true; Please don't do this. We're not in the dark ages any more. We don't do crazy locale-specific crud. There is no such thing as "latin1" any more in any valid model. For example, it is true that 0xC4 is 'Ä' in Latin1, and that the lower-case version is 'ä', and you can do the lower-casing exactly the same way as you do for US-ASCII: you just set bit 5 (or "add 32" or "subtract 0xE0" - the latter is what you seem to do, crazy as it is). So the above was fine back in the 80s, and questionably correct in the 90s, but it is COMPLETE GARBAGE to do this in the year 2023. Because 'Ä' today is *not* 0xC4. It is "0xC3 0x84" (in the sanest simplest form), and your crazy "tolower" will turn that into "0xE3 0x84", and that not only is no longer 'ä', it's not even valid UTF-8 any more. I realize that filesystem people really don't get this, but case-insensitivity is pure and utter CRAP. Really. You *cannot* do case sensitivity well. It's impossible. It's either locale-dependent, or you have to have translation models for Unicode characters that are horrifically slow and even then you *will* get it wrong, because you will start asking questions about normalization forms, and the end result is an UNMITIGATED DISASTER. I wish filesystem people just finally understood this. FAT was not a good filesystem. HFS+ is garbage. And any network filesystem that does this needs to pass locale information around and do it per-mount, not on disk. Because you *will* get it wrong. It's that simple. The only sane model these days is Unicode, and the only sane encoding for Unicode is UTF-8, but even given those objectively true facts, you have (a) people who are going to use some internal locale, because THEY HAVE TO. Maybe they have various legacy things, and they use Shift-JIS or Latin1, and they really treat filenames that way. (b) you will have people who disagree about normal forms. NFC is the only sane case, but you *will* have people who use other forms. OS X got this completely wrong, and it causes real issues. (c) you'll find that "lower-case" isn't even well-defined for various characters (the typical example is German 'ß', but there are lots of them) (d) and then you'll hit the truly crazy cases with "what about compatibility equivalence". You'll find that even in English with NBSP vs regular SPACE, but it gets crazy. End result: the only well-defined area is US-ASCII. Nothing else is even *remotely* clear. Don't touch it. You *will* screw up. Now, if you *only* use this for hashing, maybe you will feel like "you will screw up" is not such a big deal. But people will wonder why the file 'Björn' does not compare equal to the file 'BJÖRN' when in a sane locale, but then *does* compare equal if they happen to use a legacy Latin1 one. So no. Latin1 isn't that special, and if you special-case them, you *will* screw up other locales. The *only* situation where 'tolower()' and 'toupper()' are valid is for US-ASCII. And when you compare to glibc, you only compare to "some random locale that happens to be active rigth n ow". Something that the kernel itself cannot and MUST NOT do. Linus