Re: git on MacOSX and files with decomposed utf-8 file names

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 21 Jan 2008 11:41:08 -0800 (PST)

On Mon, 21 Jan 2008, Kevin Ballard wrote:
> 
> I'm not saying it's forced on you, I'm saying when you treat filenames as
> text, it DOESN'T MATTER if the string gets normalized. As long as the string
> remains equivalent, YOU DON'T CARE about the underlying byte stream.

Sure I do, because it matters a lot for things like - wait for it - things 
like checksumming it.

> Alright, fine. I'm not saying HFS+ is right in storing the normalized version,
> but I do believe the authors of HFS+ must have had a reason to do that, and I
> also believe that it shouldn't make any difference to me since it remains
> equivalent.

I've already told you the reason: they did the mistake of wanting to be 
case-independent, and a (bad) case compare is easier in NFD.

Once you give strings semantic meaning (and "case independent" implies 
that semantic meaning), suddenly normalization looks like a good idea, and 
since you're going to corrupt the data *anyway*, who cares? You just 
created a file like "Hello", and readdir() returns "hello" (because there 
was an old file under that name), and it's a lot more obviously corrupt 
than just due to normalization.

> Sure it does. Normalizing a string produces an equivalent string, and so
> unless I look at the octets the two strings are, for all intents and purposes,
> the same.

.. but you *have* to look at the octets at some point. They're kind of 
what the string is built up of. They never went away, even if you chose to 
ignore them. The encoding is really quite important, and is visible both 
in memory and on disk.

It's what shows up when you sha1sum, but it's also as simple as what shows 
up when you do an "ls -l" and look at a file size.

It doesn't matter if the text is "equivalent", when you then see the 
differences in all these small details.

You can shut your eyes as much as you want, and say that you don't care, 
but the differences are real, and they are visible.

> Decomposing and recomposing shouldn't lose any information we care about -
> when treating filenames as text, a<COMBINING DIARESIS> and <A WITH DIARESIS>
> are equivalent, and thus no distinction is made between them. I'm not sure
> what other information you might be considering lost in this case.

You're right, I messed up. I used a non-combining diaeresis, and you're 
right, it doesn't get corrupted. And I think that means that if Apple had 
used NFC, we'd not have this problem with Latin1 systems (because then the 
UTF-8 representation would be the same).

So I still think that normalization is totally idiotic, but the thing that 
actually causes most problems for people on OS X is that they chose the 
really inconvenient one.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html