On Mon, 20 Nov 2023 at 18:03, Theodore Ts'o <tytso@xxxxxxx> wrote: > > On Mon, Nov 20, 2023 at 10:07:51AM -0800, Linus Torvalds wrote: > > I'm looking at things like > > generic_ci_d_compare(), and it hurts to see the mindless "let's do > > lookups and compares one utf8 character at a time". What a disgrace. > > Somebody either *really* didn't care, or was a Unicode person who > > didn't understand the point of UTF-8. > > This isn't because of case-folding brain damage, but rather Unicode > brain damage. No, it really is just stupidity and horribleness. The thing is, when you check two strings for equality, the FIRST THING you should do is to just compare them for exactly that: equality. And no, the way you do that is not by checking each unicode character one by one. You do it by just doing a regular memcmp. In fact, you can do even better than that: while at it, check whether (a) all bytes are equal in everything but bit#5 (b) none of the bytes have the high bit set and you have now narrowed down things in a big way. You can do these things trivially one whole word at a time, and you'll handle 99% of all input without EVER doing any Unicode garbage AT ALL. Yes, yes, if you actually have complex characters, you end up having to deal with that mess. But no, that is *not* an excuse for saying "all characters are complex". So no. There is absolutely zero excuse for doing stupid things, except for "nobody has ever cared, because case folding is so stupid to begin with that people just expect it to perform horribly badly". End result: - generic_ci_d_compare() should *not* consider the memcmp() to be a "fall back to this for non-casefolded". You should start with that, and if the bytes are equal then the strings are equal. End of story. - if the bytes are not equal, then the strings *might* still compare equal if it's a casefolded directory. - but EVEN THEN you shouldn't fall back to actually doing UTF-8 decoding unless you saw the high bit being set at some point. - and if they different in anything but bit #5 and you didn't see the high bit, you know they are different. It's a bit complicated, yes. But no, doing things one unicode character at a time is just bad bad bad. Linus