On Mon, 20 Nov 2023 at 18:29, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > It's a bit complicated, yes. But no, doing things one unicode > character at a time is just bad bad bad. Put another way: the _point_ of UTF-8 is that ASCII is still ASCII. It's literally why UTF-8 doesn't suck. So you can still compare ASCII strings as-is. No, that doesn't help people who are really using other locales, and are actively using complicated characters. But it very much does mean that you can compare "Bad" and "bad" and never ever look at any unicode translation ever. In a perfect world, you'd use all the complicated DCACHE_WORD_ACCESS stuff that can do all of this one word at a time. But even if you end up doing the rules just one byte at a time, it means that you can deal with the common cases without "unicode cursors" or function calls to extract unicode characters, or anything like that. You can still treat things as bytes. So the top of generic_ci_d_compare() should probably be something trivial like this: const char *ct = name.name; unsigned int tcount = name.len; /* Handle the exact equality quickly */ if (len == tcount && !dentry_string_cmp(str, ct, tcount)) return 0; because byte-wise equality is equality even if high bits are set. After that, it should probably do something like /* Not byte-identical, but maybe igncase identical in ASCII */ do { unsigned char a, b; /* Dentry name byte */ a = *str; /* If that's NUL, the qstr needs to be done too! */ if (!a) return !!tcount; /* Alternatively, if the qstr is done, it needed to be NUL */ if (!tcount) return 1; b = *ct; if ((a | b) & 0x80) break; if (a != b) { /* Quick "not same" igncase ASCII */ if ((a ^ b) & ~32) return 1; a &= ~32; if (a < 'A' || a > 'Z') return 1; } /* Ok, same ASCII, bytefolded, go to next */ str++; ct++; tcount--; len--; } and only after THAT should it do the utf name comparison (and only on the remaining parts, since the above will have checked for common ASCII beginnings). And the above was obviously never tested, and written in the MUA, and may be completely wrong in all the details, but you get the idea. Deal with the usual cases first. Do the full unicode only when you absolutely have to. Linus