Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 20 Nov 2023 18:29:05 -0800

On Mon, 20 Nov 2023 at 18:03, Theodore Ts'o <tytso@xxxxxxx> wrote:
>
> On Mon, Nov 20, 2023 at 10:07:51AM -0800, Linus Torvalds wrote:
> >     I'm looking at things like
> > generic_ci_d_compare(), and it hurts to see the mindless "let's do
> > lookups and compares one utf8 character at a time". What a disgrace.
> > Somebody either *really* didn't care, or was a Unicode person who
> > didn't understand the point of UTF-8.
>
> This isn't because of case-folding brain damage, but rather Unicode
> brain damage.

No, it really is just stupidity and horribleness.

The thing is, when you check two strings for equality, the FIRST THING
you should do is to just compare them for exactly that: equality.

And no, the way you do that is not by checking each unicode character
one by one.

You do it by just doing a regular memcmp. In fact, you can do even
better than that: while at it, check whether
 (a) all bytes are equal in everything but bit#5
 (b) none of the bytes have the high  bit set
and you have now narrowed down things in a big way. You can do these
things trivially one whole word at a time, and you'll handle 99% of
all input without EVER doing any Unicode garbage AT ALL.

Yes, yes, if you actually have complex characters, you end up having
to deal with that mess. But no, that is *not* an excuse for saying
"all characters are complex".

So no. There is absolutely zero excuse for doing stupid things, except
for "nobody has ever cared, because case folding is so stupid to begin
with that people just expect it to perform horribly badly".

End result:

 - generic_ci_d_compare() should *not* consider the memcmp() to be a
"fall back to this for non-casefolded". You should start with that,
and if the bytes are equal then the strings are equal. End of story.

 - if the bytes are not equal, then the strings *might* still compare
equal if it's a casefolded directory.

 - but EVEN THEN you shouldn't fall back to actually doing UTF-8
decoding unless you saw the high bit being set at some point.

 - and if they different in anything but bit #5 and you didn't see the
high bit, you know they are different.

It's a bit complicated, yes. But no, doing things one unicode
character at a time is just bad bad bad.

               Linus