On Tue, Mar 03, 2020 at 06:13:56PM +0800, lampahome wrote: > > > And yes, once the strings are normalised and encoded as UTF-8 you then > > do a byte-by-byte comparison (if the comparison is case-insensitive then > > fs/unicode/... will case-fold the Unicode symbols during normalisation). > > What I'm confused is why encoded as utf-8 after normalize finished? > From above, turn "ñ" (U+00F1) and "n◌̃" (U+006E U+0303) into the same > Unicode string. Then why should we just compare bytes from normalized. For the same reason why we don't upcase or downcase all of the letters in a directory with case-folding. The term for this is "case-preserving, case-insensitive" matching. So that means that if you save a file as "Makefile", ls will return "Makefile", and not "MAKEFILE" or "makefile". Of course, if you delete or truncate "makefile", it will affect the file stored in the directory as "Makefile", and the file system will not allow a directory with case-folding enabled to contain "makefile" and "Makefile" at the same time. Simiarly, with normalization, we preserve the existing utf-8 form (both the composed and decomposed forms are valid utf-8), but we compare without taking the composition form into account. Cheers, - Ted P.S. Some people may hate this, but if the goal is interoperability with how Windows and MacOS does things, this is basically what they do as well. (Well, mostly; MacOS is a little weird for historical reasons.) P.P.S. And before you comment on it, as one Internationalization expert once said, I18N *is* complicated. It truly would be easier to teach all of the world to speak a single language and use it as the "Federation Standard" language, ala Star Trek. For better or for worse, that's not happening, and so we deal with the world as it is, not as we would like it to be. :-)