Re: git on MacOSX and files with decomposed utf-8 file names

Peter Karlsson <peter@xxxxxxxxxxxxxxxx> · Fri, 18 Jan 2008 11:19:21 +0100 (CET)

Linus Torvalds:

> But that's exactly the case he gave - 'ä' vs 'a¨' are exactly that: 
> different strings (not even characters: the second is actually a 
> multi-character) that just look the same.

But they are not different strings, they are canonically equivalent as
far as Unicode is concerned. They're even supposed to map to the same
glyph (if the font has an "ä", it should display it in both cases, if
it has an "a" and a combining diaeresis, it should make up one).

You cannot do a binary comparison of text to see if two strings are
equivalent.

> You try to twist the argument by just claiming that they are the same
> "character". They aren't, unless you *define* character to be the
> same as "glyph".

Whereas you are confusing characters and code points.

"ä" and "a¨" use different code points, but they encode the same
character, and from the user's perspective it is the *character* that
is interesting (although he might confuse it with the glyph).

> I don't know how NTFS works (I know it is Unicode-aware, and I think
> it encodes filenames in UCS-2 or possibly UTF-16,

Actually, NTFS is a bit broken. It sees file names as a string of
16-bit words. It doesn't check that it is valid UTF-16, or even valid
UCS-2, it allows almost anything.

Apple made Mac OS X handle filenames properly, by seeing that file
names are a string of characters, not code points, so they use a
canonical form for all characters (personally, I would have preferred
the pre-composed form, though).

-- 
\\// Peter - http://www.softwolves.pp.se/
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html