Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 20 Nov 2023 19:03:13 -0800

On Mon, 20 Nov 2023 at 18:29, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> It's a bit complicated, yes. But no, doing things one unicode
> character at a time is just bad bad bad.

Put another way: the _point_ of UTF-8 is that ASCII is still ASCII.
It's literally why UTF-8 doesn't suck.

So you can still compare ASCII strings as-is.

No, that doesn't help people who are really using other locales, and
are actively using complicated characters.

But it very much does mean that you can compare "Bad" and "bad" and
never ever look at any unicode translation ever.

In a perfect world, you'd use all the complicated DCACHE_WORD_ACCESS
stuff that can do all of this one word at a time.

But even if you end up doing the rules just one byte at a time, it
means that you can deal with the common cases without "unicode
cursors" or function calls to extract unicode characters, or anything
like that. You can still treat things as bytes.

So the top of generic_ci_d_compare() should probably be something
trivial like this:

        const char *ct = name.name;
        unsigned int tcount = name.len;

        /* Handle the exact equality quickly */
        if (len == tcount && !dentry_string_cmp(str, ct, tcount))
                return 0;

because byte-wise equality is equality even if high bits are set.

After that, it should probably do something like

        /* Not byte-identical, but maybe igncase identical in ASCII */
        do {
                unsigned char a, b;

                /* Dentry name byte */
                a = *str;

                /* If that's NUL, the qstr needs to be done too! */
                if (!a)
                        return !!tcount;

                /* Alternatively, if the qstr is done, it needed to be NUL */
                if (!tcount)
                        return 1;
                b = *ct;

                if ((a | b) & 0x80)
                        break;

                if (a != b) {
                        /* Quick "not same" igncase ASCII */
                        if ((a ^ b) & ~32)
                                return 1;
                        a &= ~32;
                        if (a < 'A' || a > 'Z')
                                return 1;
                }

                /* Ok, same ASCII, bytefolded, go to next */
                str++;
                ct++;
                tcount--;
                len--;
        }

and only after THAT should it do the utf name comparison (and only on
the remaining parts, since the above will have checked for common
ASCII beginnings).

And the above was obviously never tested, and written in the MUA, and
may be completely wrong in all the details, but you get the idea. Deal
with the usual cases first. Do the full unicode only when you
absolutely have to.

                Linus