Re: [PATCH 1/3] xfs: stabilize the tolower function used for ascii-ci dir hash computation

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 4 Apr 2023 11:32:14 -0700

On Tue, Apr 04, 2023 at 10:54:27AM -0700, Linus Torvalds wrote:
> On Tue, Apr 4, 2023 at 10:07 AM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> >
> > +       if (c >= 0xc0 && c <= 0xd6)     /* latin A-O with accents */
> > +               return true;
> > +       if (c >= 0xd8 && c <= 0xde)     /* latin O-Y with accents */
> > +               return true;
> 
> Please don't do this.
> 
> We're not in the dark ages any more. We don't do crazy locale-specific
> crud. There is no such thing as "latin1" any more in any valid model.
> 
> For example, it is true that 0xC4 is 'Ä' in Latin1, and that the
> lower-case version is 'ä', and you can do the lower-casing exactly the
> same way as you do for US-ASCII: you just set bit 5 (or "add 32" or
> "subtract 0xE0" - the latter is what you seem to do, crazy as it is).
> 
> So the above was fine back in the 80s, and questionably correct in the
> 90s, but it is COMPLETE GARBAGE to do this in the year 2023.

Yeah, I get that.  Fifteen years ago, Barry Naujok and Christoph merged
this weird ascii-ci feature for XFS that purportedly does ... something.
It clearly only works properly if you force userspace to use latin1,
which is totally nuts in 2023 given that the distros default to UTF8
and likely don't test anything else.  It probably wasn't even a good
idea in *2008*, but it went in anyway.  Nobody tested this feature,
metadump breaks with this feature enabled, but as maintainer I get to
maintain these poorly designed half baked projects.

I wouldn't ever enable this feature on any computer I use, and I think
the unicode case-insensitive stuff that's been put in to ext4 and f2fs
lately are not a tarpit that I ever want to visit in XFS.  Directory
names should be sequences of bytes that don't include nulls or slashes,
end of story.

Frankly, I'd rather just drop the entire ascii-ci feature from XFS.  I
doubt anyone's really using it given that xfs_repair will corrupt the
filesystem.  This patch encodes the isupper/tolower code from the
kernel's lib/ctype.c into libxfs so that repair will stop corrupting
these filesystems, nothing more.

We're not introducing any new functionality, just making stupid code
less broken.  I wasn't around let alone maintainer when this feature was
committed, and I wouldn't let it in now.  I am, however, the stuckee who
has to clean up all this shit in the least impactful way I can devise.

Can we instead simply drop ascii-ci support from Linux 6.3i and abruptly
break all those filesystems?  Even though we're past -rc5 now?  That
would make all of this baloney go away, at a cost of breaking userspace.

> Because 'Ä' today is *not* 0xC4. It is "0xC3 0x84" (in the sanest
> simplest form), and your crazy "tolower" will turn that into "0xE3
> 0x84", and that not only is no longer 'ä', it's not even valid UTF-8
> any  more.
> 
> I realize that filesystem people really don't get this, but
> case-insensitivity is pure and utter CRAP. Really. You *cannot* do
> case sensitivity well. It's impossible. It's either locale-dependent,
> or you have to have translation models for Unicode characters that are
> horrifically slow and even then you *will* get it wrong, because you
> will start asking questions about normalization forms, and the end
> result is an UNMITIGATED DISASTER.

Agreed!

> I wish filesystem people just finally understood this.  FAT was not a
> good filesystem.  HFS+ is garbage. And any network filesystem that
> does this needs to pass locale information around and do it per-mount,
> not on disk.
> 
> Because you *will* get it wrong. It's that simple. The only sane model
> these days is Unicode, and the only sane encoding for Unicode is
> UTF-8, but even given those objectively true facts, you have
> 
>  (a) people who are going to use some internal locale, because THEY
> HAVE TO. Maybe they have various legacy things, and they use Shift-JIS
> or Latin1, and they really treat filenames that way.
> 
>  (b) you will have people who disagree about normal forms. NFC is the
> only sane case, but you *will* have people who use other forms. OS X
> got this completely wrong, and it causes real issues.
> 
>  (c) you'll find that "lower-case" isn't even well-defined for various
> characters (the typical example is German 'ß', but there are lots of
> them)
> 
>  (d) and then you'll hit the truly crazy cases with "what about
> compatibility equivalence". You'll find that even in English with NBSP
> vs regular SPACE, but it gets crazy.
> 
> End result: the only well-defined area is US-ASCII. Nothing else is
> even *remotely* clear. Don't touch it. You *will* screw up.
> 
> Now, if you *only* use this for hashing, maybe you will feel like "you
> will screw up" is not such a big deal.
> 
> But people will wonder why the file 'Björn' does not compare equal to
> the file 'BJÖRN' when in a sane locale, but then *does* compare equal
> if they happen to use a legacy Latin1 one.
> 
> So no. Latin1 isn't that special, and if you special-case them, you
> *will* screw up other locales.
> 
> The *only* situation where 'tolower()' and 'toupper()' are valid is
> for US-ASCII.
> 
> And when you compare to glibc, you only compare to "some random locale
> that happens to be active rigth n ow". Something that the kernel
> itself cannot and MUST NOT do.

What then is the point of having tolower in the kernel at all?

--D

>                 Linus