Re: [PATCH 1/3] xfs: stabilize the tolower function used for ascii-ci dir hash computation

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Tue, 4 Apr 2023 10:54:27 -0700

On Tue, Apr 4, 2023 at 10:07 AM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
>
> +       if (c >= 0xc0 && c <= 0xd6)     /* latin A-O with accents */
> +               return true;
> +       if (c >= 0xd8 && c <= 0xde)     /* latin O-Y with accents */
> +               return true;

Please don't do this.

We're not in the dark ages any more. We don't do crazy locale-specific
crud. There is no such thing as "latin1" any more in any valid model.

For example, it is true that 0xC4 is 'Ä' in Latin1, and that the
lower-case version is 'ä', and you can do the lower-casing exactly the
same way as you do for US-ASCII: you just set bit 5 (or "add 32" or
"subtract 0xE0" - the latter is what you seem to do, crazy as it is).

So the above was fine back in the 80s, and questionably correct in the
90s, but it is COMPLETE GARBAGE to do this in the year 2023.

Because 'Ä' today is *not* 0xC4. It is "0xC3 0x84" (in the sanest
simplest form), and your crazy "tolower" will turn that into "0xE3
0x84", and that not only is no longer 'ä', it's not even valid UTF-8
any  more.

I realize that filesystem people really don't get this, but
case-insensitivity is pure and utter CRAP. Really. You *cannot* do
case sensitivity well. It's impossible. It's either locale-dependent,
or you have to have translation models for Unicode characters that are
horrifically slow and even then you *will* get it wrong, because you
will start asking questions about normalization forms, and the end
result is an UNMITIGATED DISASTER.

I wish filesystem people just finally understood this.  FAT was not a
good filesystem.  HFS+ is garbage. And any network filesystem that
does this needs to pass locale information around and do it per-mount,
not on disk.

Because you *will* get it wrong. It's that simple. The only sane model
these days is Unicode, and the only sane encoding for Unicode is
UTF-8, but even given those objectively true facts, you have

 (a) people who are going to use some internal locale, because THEY
HAVE TO. Maybe they have various legacy things, and they use Shift-JIS
or Latin1, and they really treat filenames that way.

 (b) you will have people who disagree about normal forms. NFC is the
only sane case, but you *will* have people who use other forms. OS X
got this completely wrong, and it causes real issues.

 (c) you'll find that "lower-case" isn't even well-defined for various
characters (the typical example is German 'ß', but there are lots of
them)

 (d) and then you'll hit the truly crazy cases with "what about
compatibility equivalence". You'll find that even in English with NBSP
vs regular SPACE, but it gets crazy.

End result: the only well-defined area is US-ASCII. Nothing else is
even *remotely* clear. Don't touch it. You *will* screw up.

Now, if you *only* use this for hashing, maybe you will feel like "you
will screw up" is not such a big deal.

But people will wonder why the file 'Björn' does not compare equal to
the file 'BJÖRN' when in a sane locale, but then *does* compare equal
if they happen to use a legacy Latin1 one.

So no. Latin1 isn't that special, and if you special-case them, you
*will* screw up other locales.

The *only* situation where 'tolower()' and 'toupper()' are valid is
for US-ASCII.

And when you compare to glibc, you only compare to "some random locale
that happens to be active rigth n ow". Something that the kernel
itself cannot and MUST NOT do.

                Linus