Re: why do we need utf8 normalization when compare name?

"Theodore Y. Ts'o" <tytso@xxxxxxx> · Tue, 3 Mar 2020 12:22:09 -0500

On Tue, Mar 03, 2020 at 06:13:56PM +0800, lampahome wrote:
> 
> > And yes, once the strings are normalised and encoded as UTF-8 you then
> > do a byte-by-byte comparison (if the comparison is case-insensitive then
> > fs/unicode/... will case-fold the Unicode symbols during normalisation).
> 
> What I'm confused is why encoded as utf-8 after normalize finished?
> From above, turn "ñ" (U+00F1) and "n◌̃" (U+006E U+0303) into the same
> Unicode string. Then why should we just compare bytes from normalized.

For the same reason why we don't upcase or downcase all of the letters
in a directory with case-folding.  The term for this is
"case-preserving, case-insensitive" matching.  So that means that if
you save a file as "Makefile", ls will return "Makefile", and not
"MAKEFILE" or "makefile".

Of course, if you delete or truncate "makefile", it will affect the
file stored in the directory as "Makefile", and the file system will
not allow a directory with case-folding enabled to contain "makefile"
and "Makefile" at the same time.

Simiarly, with normalization, we preserve the existing utf-8 form
(both the composed and decomposed forms are valid utf-8), but we
compare without taking the composition form into account.

Cheers,

					- Ted

P.S.  Some people may hate this, but if the goal is interoperability
with how Windows and MacOS does things, this is basically what they do
as well.  (Well, mostly; MacOS is a little weird for historical
reasons.)

P.P.S.  And before you comment on it, as one Internationalization
expert once said, I18N *is* complicated.  It truly would be easier to
teach all of the world to speak a single language and use it as the
"Federation Standard" language, ala Star Trek.  For better or for
worse, that's not happening, and so we deal with the world as it is,
not as we would like it to be.  :-)