Re: git on MacOSX and files with decomposed utf-8 file names

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 21 Jan 2008 10:12:01 -0800 (PST)

On Mon, 21 Jan 2008, Kevin Ballard wrote:
> On Jan 21, 2008, at 9:14 AM, Peter Karlsson wrote:
> > 
> > I happen to prefer the text-as-string-of-characters (or code points,
> > since you use the other meaning of characters in your posts), since I
> > come from the text world, having worked a lot on Unicode text
> > processing.
> > 
> > You apparently prefer the text-as-sequence-of-octets, which I tend to
> > dislike because I would have thought computer engineers would have
> > evolved beyond this when we left the 1900s.
> 
> I agree. Every single problem that I can recall Linus bringing up as a
> consequence of HFS+ treating filenames as strings [..]

You say "I agree", BUT YOU DON'T EVEN SEEM TO UNDERSTAND WHAT IS GOING ON.

The fact is, text-as-string-of-codepoints (let's make the "codepoints" 
obvious, so that there is no ambiguity, but I'd also like to make it clear 
that a codepoint *is* how a Unicode character is defined, and a Unicode 
"string" is actually *defined* to be a sequence of codepoints, and totally 
independent of normalization!) is fine.

That was never the issue at all. Unicode codepoints are wonderful.

Now, git _also_ heavily depends on the actual encoding of those 
codepoints, since we create hashes etc, so in fact, as far ass git is 
concerned, names have to be in some particular encoding to be hashed, and 
UTF-8 is the only sane encoding for Unicode. People can blather about 
UCS-2 and UTF-16 and UTF-32 all they want, but the fact is, UTF-8 is 
simply technically superior in so many ways that I don't even understand 
why anybody ever uses anything else.

So I would not disagree with using UTF-8 at all.

But that is *entirely* a separate issue from "normalization". 

Kevin, you seem to think that normalization is somehow forced on you by 
the "text-as-codepoints" decision, and that is SIMPLY NOT TRUE. 
Normalization is a totally separate decision, and it's a STUPID one, 
because it breaks so many of the _nice_ properties of using UTF-8.

And THAT is where we differ. It has nothing to do with "octets". It has 
nothing to do with not liking Unicode. It has nothing to do with 
"strings". 

In short:

 - normalization is by no means required or even a good feature. It's 
   something you do when you want to know if two strings are equivalent, 
   but that doesn't actually mean that you should keep the strings 
   normalized all the time!

 - normalization has *nothing* to do with "treating text as octets". 
   That's entirely an encoding issue.

 - of *course* git has to treat things as a binary stream at some point, 
   since you need that to even compute a SHA1 in the first place, but that 
   has *nothing* to do with normalization or the lack of it.

Got it? Forced normalization is stupid, because it changes the data and 
removes information, and unless you know that change is safe, it's the 
wrong thing to do.

One reason _not_ to do normalization is that if you don't, you can still 
interact with no ambiguity with other non-Unicode locales. You can do the 
1:1 Latin1<->Unicode translation, and you *never* get into trouble. In 
cotnrast, if you normalize, it's no longer a 1:1 translation any more, and 
you can get into a situation where the translation from Latin1 to Unicode 
and back results in a *different* filename than the one you started with!

See? That's a *serious*problem*. A system that forces normalization BY 
DEFINITION cannot work with people who use a Latin1 filesystem, because it 
will corrupt the filenames!

But you are apparently too damn stupid to understand that "data 
corruption" == "bad", and too damn stupid to see that "Unicode" does not 
mean "Forced normalization".

But I'll try one more time. Let's say that I work on a project where there 
are some people who use Latin1, and some people who use UTF-8, and we use 
special characters. It should all work, as long as we use only the common 
subset, and we teach git to convert to UTF-8 as a common base. Right?

In your *idiotic* world, where you have to normalize and corrupting 
filenames is ok, that doesn't work! It works wonderfully well if you do 
the obvious 1:1 translation and you do *not* normalize, but the moment you 
start normalizing, you actually corrupt the filenames!

And yes, the character sequence 'a¨' is exactly one such sequence. It's 
perfectly representable in both Latin1 and in UTF-8: in latin1 it is a 
two-character '\x61\xa8', and when doing a Latin1->UTF-8 conversion, it 
becomes '\x61\xc2\xa8', and you can convert back and forth between those 
two forms an infinite amount of times, and you never corrupt it.

But the moment you add normalization to the mix, you start screwing up. 
Suddenly, the sequence '\x61\xa8' in Latin1 becomes (assuming NFD) 
'\xc3\xa4' in UTF-8, and when converted back to Latin1, it is now '\xe4', 
ie that filename hass been corrupted!

See? Normalization in the face of working together with others is a total 
and utter mistake, and yes, it really *does* corrupt data. It makes it 
fundamentally impossible to reliably work together with other encodings - 
even when you do converstion between the two!

[ And that's the really sad part. Non-normalized Unicode can pretty much 
  be used as a "generic encoding" for just about all locales - if you know 
  the locale you convert from and to, you can generally use UTF-8 as an 
  internal format, knowing that you can always get the same result back in 
  the original encoding. Normalization literally breaks that wonderful 
  generic capability of Unicode.

  And the fact that Unicode is such a "generic replacement" for any locale 
  is exactly what makes it so wonderful, and allows you to fairly 
  seamlessly convert piece-meal from some particular locale to Unicode: 
  even if you have some programs that still work in the original locale, 
  you know that you can convert back to it without loss of information.

  Except if you normalize. In that case, you *do* lose information, and 
  suddenly one of the best things about Unicode simply disappears.

  As a result, people who force-normalize are idiots. But they seem to 
  also be stupid enough that they don't understand that they are idiots.
  Sad. 

  It's a bit like whitespace. Whitespace "doesn't matter" in text (== is 
  equivalent), but an email client that force-normalizes whitespace in 
  text is a really *broken* email client, because it turns out that 
  sometimes even the "equivalent" forms simply do matter. Patches are 
  text, but whitespace is meaningful there. 

  Same exact deal: it's good to have the *ability* to normalize 
  whitespace (in email, we call this "text=flowed" or similar), and in 
  some ceses you might even want to make it the default action, but 
  *forcing* normalization is total idiocy and actually makes the system 
  less useful! ]

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html