Re: git on MacOSX and files with decomposed utf-8 file names

Kevin Ballard <kevin@xxxxxx> · Mon, 21 Jan 2008 14:05:51 -0500

On Jan 21, 2008, at 1:12 PM, Linus Torvalds wrote:

On Mon, 21 Jan 2008, Kevin Ballard wrote:
On Jan 21, 2008, at 9:14 AM, Peter Karlsson wrote:

I happen to prefer the text-as-string-of-characters (or code points,
since you use the other meaning of characters in your posts),  
since I
come from the text world, having worked a lot on Unicode text
processing.

You apparently prefer the text-as-sequence-of-octets, which I tend  
to
dislike because I would have thought computer engineers would have
evolved beyond this when we left the 1900s.

I agree. Every single problem that I can recall Linus bringing up  
as a
consequence of HFS+ treating filenames as strings [..]

You say "I agree", BUT YOU DON'T EVEN SEEM TO UNDERSTAND WHAT IS  
GOING ON.

I could say the same thing about you.

The fact is, text-as-string-of-codepoints (let's make the "codepoints"
obvious, so that there is no ambiguity, but I'd also like to make it  
clear
that a codepoint *is* how a Unicode character is defined, and a  
Unicode
"string" is actually *defined* to be a sequence of codepoints, and  
totally
independent of normalization!) is fine.

That was never the issue at all. Unicode codepoints are wonderful.

Now, git _also_ heavily depends on the actual encoding of those
codepoints, since we create hashes etc, so in fact, as far ass git is
concerned, names have to be in some particular encoding to be  
hashed, and
UTF-8 is the only sane encoding for Unicode. People can blather about
UCS-2 and UTF-16 and UTF-32 all they want, but the fact is, UTF-8 is
simply technically superior in so many ways that I don't even  
understand
why anybody ever uses anything else.

So I would not disagree with using UTF-8 at all.

But that is *entirely* a separate issue from "normalization".

Kevin, you seem to think that normalization is somehow forced on you  
by
the "text-as-codepoints" decision, and that is SIMPLY NOT TRUE.
Normalization is a totally separate decision, and it's a STUPID one,
because it breaks so many of the _nice_ properties of using UTF-8.

I'm not saying it's forced on you, I'm saying when you treat filenames  
as text, it DOESN'T MATTER if the string gets normalized. As long as  
the string remains equivalent, YOU DON'T CARE about the underlying  
byte stream.

And THAT is where we differ. It has nothing to do with "octets". It  
has
nothing to do with not liking Unicode. It has nothing to do with
"strings".

In short:

- normalization is by no means required or even a good feature. It's
  something you do when you want to know if two strings are  
equivalent,
  but that doesn't actually mean that you should keep the strings
  normalized all the time!

Alright, fine. I'm not saying HFS+ is right in storing the normalized  
version, but I do believe the authors of HFS+ must have had a reason  
to do that, and I also believe that it shouldn't make any difference  
to me since it remains equivalent.

- normalization has *nothing* to do with "treating text as octets".
  That's entirely an encoding issue.

Sure it does. Normalizing a string produces an equivalent string, and  
so unless I look at the octets the two strings are, for all intents  
and purposes, the same.

- of *course* git has to treat things as a binary stream at some  
point,
  since you need that to even compute a SHA1 in the first place, but  
that
  has *nothing* to do with normalization or the lack of it.

You're right, but it doesn't have to treat it as a binary stream at  
the level I care about. I mean, no matter what you do at some level  
the string is evaluated as a binary stream. For our purposes, just  
redefine the hashing algorithm to hash all equivalent strings the  
same, and you can implement that by using SHA1 on a particular  
encoding of the string.

Got it? Forced normalization is stupid, because it changes the data  
and
removes information, and unless you know that change is safe, it's the
wrong thing to do.

Decomposing and recomposing shouldn't lose any information we care  
about - when treating filenames as text, a<COMBINING DIARESIS> and <A  
WITH DIARESIS> are equivalent, and thus no distinction is made between  
them. I'm not sure what other information you might be considering  
lost in this case.

One reason _not_ to do normalization is that if you don't, you can  
still
interact with no ambiguity with other non-Unicode locales. You can  
do the
1:1 Latin1<->Unicode translation, and you *never* get into trouble. In
cotnrast, if you normalize, it's no longer a 1:1 translation any  
more, and
you can get into a situation where the translation from Latin1 to  
Unicode
and back results in a *different* filename than the one you started  
with!

I don't believe you. See below.

See? That's a *serious*problem*. A system that forces normalization BY
DEFINITION cannot work with people who use a Latin1 filesystem,  
because it
will corrupt the filenames!

But you are apparently too damn stupid to understand that "data
corruption" == "bad", and too damn stupid to see that "Unicode" does  
not
mean "Forced normalization".

When have I ever said that Unicode meant Forced normalization?

But I'll try one more time. Let's say that I work on a project where  
there
are some people who use Latin1, and some people who use UTF-8, and  
we use
special characters. It should all work, as long as we use only the  
common
subset, and we teach git to convert to UTF-8 as a common base. Right?

In your *idiotic* world, where you have to normalize and corrupting
filenames is ok, that doesn't work! It works wonderfully well if you  
do
the obvious 1:1 translation and you do *not* normalize, but the  
moment you
start normalizing, you actually corrupt the filenames!

Wrong.

And yes, the character sequence 'a¨' is exactly one such sequence.  
It's
perfectly representable in both Latin1 and in UTF-8: in latin1 it is a
two-character '\x61\xa8', and when doing a Latin1->UTF-8 conversion,  
it
becomes '\x61\xc2\xa8', and you can convert back and forth between  
those
two forms an infinite amount of times, and you never corrupt it.

But the moment you add normalization to the mix, you start screwing  
up.
Suddenly, the sequence '\x61\xa8' in Latin1 becomes (assuming NFD)
'\xc3\xa4' in UTF-8, and when converted back to Latin1, it is now  
'\xe4',
ie that filename hass been corrupted!

Wrong. '\x61\x18' in Latin1, when converted to UTF-8 (NFD) is still  
'\x61\xc2\xa8'. You're mixing up DIARESIS (U+00A8) and COMBINING  
DIARESIS (U+0308).

I suspect this is why you've been yelling so much - you have a  
fundamental misunderstanding about what normalization is actually doing.

See? Normalization in the face of working together with others is a  
total
and utter mistake, and yes, it really *does* corrupt data. It makes it
fundamentally impossible to reliably work together with other  
encodings -
even when you do converstion between the two!

[ And that's the really sad part. Non-normalized Unicode can pretty  
much
 be used as a "generic encoding" for just about all locales - if you  
know
 the locale you convert from and to, you can generally use UTF-8 as an
 internal format, knowing that you can always get the same result  
back in
 the original encoding. Normalization literally breaks that wonderful
 generic capability of Unicode.

 And the fact that Unicode is such a "generic replacement" for any  
locale
 is exactly what makes it so wonderful, and allows you to fairly
 seamlessly convert piece-meal from some particular locale to Unicode:
 even if you have some programs that still work in the original  
locale,
 you know that you can convert back to it without loss of information.

 Except if you normalize. In that case, you *do* lose information, and
 suddenly one of the best things about Unicode simply disappears.

See above as to why you're not losing the information you so fervently  
believe you are.

 As a result, people who force-normalize are idiots. But they seem to
 also be stupid enough that they don't understand that they are  
idiots.
 Sad.

People who insult others run the risk of looking like a fool when  
shown to be wrong.

 It's a bit like whitespace. Whitespace "doesn't matter" in text (==  
is
 equivalent), but an email client that force-normalizes whitespace in
 text is a really *broken* email client, because it turns out that
 sometimes even the "equivalent" forms simply do matter. Patches are
 text, but whitespace is meaningful there.

 Same exact deal: it's good to have the *ability* to normalize
 whitespace (in email, we call this "text=flowed" or similar), and in
 some ceses you might even want to make it the default action, but
 *forcing* normalization is total idiocy and actually makes the system
 less useful! ]

Sure, it all depends on what level you need to evaluate text. If we're  
talking about english paragraphs, then whitespace can be messed with.  
When we're talking about unicode strings, then specific encoding can  
be messed with. When talking about byte sequence, nothing can be  
messed with.

In our case, when working on an HFS+ filesystem all you have to care  
about is the unicode string level. The specific encoding can be messed  
with, and the client shouldn't care. Problems only arise when  
attempting to interoperate with filesystems that work at the byte  
sequence level.

The only information you lose when doing canonical normalization is  
what the original byte sequence was. Sure, this is a problem when  
working on a filesystem that cares about byte sequence, but it's not a  
problem when working on a filesystem that cares about the unicode  
string.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com

<<attachment: smime.p7s>>