On Mon, Nov 20, 2023 at 10:07:51AM -0800, Linus Torvalds wrote: > Of course, "do it in shared generic code" doesn't tend to really fix > the braindamage, but at least it's now shared braindamage and not > spread out all over. I'm looking at things like > generic_ci_d_compare(), and it hurts to see the mindless "let's do > lookups and compares one utf8 character at a time". What a disgrace. > Somebody either *really* didn't care, or was a Unicode person who > didn't understand the point of UTF-8. This isn't because of case-folding brain damage, but rather Unicode brain damage. We compare one character at a time because it's possible for some character like é to either be encoded as 0x0089 (aka "Latin Small Letter E with Acute") OR as 0x0065 0x0301 ("Latin Small Letter E" plus "Combining Acute Accent"). Typically, we pretend that UTF-8 means that we can just encode é, or 0x0089 as 0xC3 0xA9 and then call it a day and just use strcmp(3) on the sucker. But Unicode is a lot more insane than that. Technically, 0x65 0xCC 0x81 is the same character as 0xC3 0xA9. > Oh well. I guess people went "this is going to suck anyway, so let's > make sure it *really* sucks". It's more like, "this is going to suck, but if it's going to suck anyway, let's implement the full Unicode spec in all its gory^H^H^H^H glory, whether or not it's sane". - Ted