Re: git on MacOSX and files with decomposed utf-8 file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Jan 21, 2008, at 1:12 PM, Linus Torvalds wrote:

On Mon, 21 Jan 2008, Kevin Ballard wrote:
On Jan 21, 2008, at 9:14 AM, Peter Karlsson wrote:

I happen to prefer the text-as-string-of-characters (or code points,
since you use the other meaning of characters in your posts), since I
come from the text world, having worked a lot on Unicode text
processing.

You apparently prefer the text-as-sequence-of-octets, which I tend to
dislike because I would have thought computer engineers would have
evolved beyond this when we left the 1900s.

I agree. Every single problem that I can recall Linus bringing up as a
consequence of HFS+ treating filenames as strings [..]

You say "I agree", BUT YOU DON'T EVEN SEEM TO UNDERSTAND WHAT IS GOING ON.

I could say the same thing about you.

The fact is, text-as-string-of-codepoints (let's make the "codepoints"
obvious, so that there is no ambiguity, but I'd also like to make it clear that a codepoint *is* how a Unicode character is defined, and a Unicode "string" is actually *defined* to be a sequence of codepoints, and totally
independent of normalization!) is fine.

That was never the issue at all. Unicode codepoints are wonderful.

Now, git _also_ heavily depends on the actual encoding of those
codepoints, since we create hashes etc, so in fact, as far ass git is
concerned, names have to be in some particular encoding to be hashed, and
UTF-8 is the only sane encoding for Unicode. People can blather about
UCS-2 and UTF-16 and UTF-32 all they want, but the fact is, UTF-8 is
simply technically superior in so many ways that I don't even understand
why anybody ever uses anything else.

So I would not disagree with using UTF-8 at all.

But that is *entirely* a separate issue from "normalization".

Kevin, you seem to think that normalization is somehow forced on you by
the "text-as-codepoints" decision, and that is SIMPLY NOT TRUE.
Normalization is a totally separate decision, and it's a STUPID one,
because it breaks so many of the _nice_ properties of using UTF-8.

I'm not saying it's forced on you, I'm saying when you treat filenames as text, it DOESN'T MATTER if the string gets normalized. As long as the string remains equivalent, YOU DON'T CARE about the underlying byte stream.

And THAT is where we differ. It has nothing to do with "octets". It has
nothing to do with not liking Unicode. It has nothing to do with
"strings".

In short:

- normalization is by no means required or even a good feature. It's
something you do when you want to know if two strings are equivalent,
  but that doesn't actually mean that you should keep the strings
  normalized all the time!

Alright, fine. I'm not saying HFS+ is right in storing the normalized version, but I do believe the authors of HFS+ must have had a reason to do that, and I also believe that it shouldn't make any difference to me since it remains equivalent.

- normalization has *nothing* to do with "treating text as octets".
  That's entirely an encoding issue.

Sure it does. Normalizing a string produces an equivalent string, and so unless I look at the octets the two strings are, for all intents and purposes, the same.

- of *course* git has to treat things as a binary stream at some point, since you need that to even compute a SHA1 in the first place, but that
  has *nothing* to do with normalization or the lack of it.

You're right, but it doesn't have to treat it as a binary stream at the level I care about. I mean, no matter what you do at some level the string is evaluated as a binary stream. For our purposes, just redefine the hashing algorithm to hash all equivalent strings the same, and you can implement that by using SHA1 on a particular encoding of the string.

Got it? Forced normalization is stupid, because it changes the data and
removes information, and unless you know that change is safe, it's the
wrong thing to do.

Decomposing and recomposing shouldn't lose any information we care about - when treating filenames as text, a<COMBINING DIARESIS> and <A WITH DIARESIS> are equivalent, and thus no distinction is made between them. I'm not sure what other information you might be considering lost in this case.

One reason _not_ to do normalization is that if you don't, you can still interact with no ambiguity with other non-Unicode locales. You can do the
1:1 Latin1<->Unicode translation, and you *never* get into trouble. In
cotnrast, if you normalize, it's no longer a 1:1 translation any more, and you can get into a situation where the translation from Latin1 to Unicode and back results in a *different* filename than the one you started with!

I don't believe you. See below.

See? That's a *serious*problem*. A system that forces normalization BY
DEFINITION cannot work with people who use a Latin1 filesystem, because it
will corrupt the filenames!

But you are apparently too damn stupid to understand that "data
corruption" == "bad", and too damn stupid to see that "Unicode" does not
mean "Forced normalization".

When have I ever said that Unicode meant Forced normalization?

But I'll try one more time. Let's say that I work on a project where there are some people who use Latin1, and some people who use UTF-8, and we use special characters. It should all work, as long as we use only the common
subset, and we teach git to convert to UTF-8 as a common base. Right?

In your *idiotic* world, where you have to normalize and corrupting
filenames is ok, that doesn't work! It works wonderfully well if you do the obvious 1:1 translation and you do *not* normalize, but the moment you
start normalizing, you actually corrupt the filenames!

Wrong.

And yes, the character sequence 'a¨' is exactly one such sequence. It's
perfectly representable in both Latin1 and in UTF-8: in latin1 it is a
two-character '\x61\xa8', and when doing a Latin1->UTF-8 conversion, it becomes '\x61\xc2\xa8', and you can convert back and forth between those
two forms an infinite amount of times, and you never corrupt it.

But the moment you add normalization to the mix, you start screwing up.
Suddenly, the sequence '\x61\xa8' in Latin1 becomes (assuming NFD)
'\xc3\xa4' in UTF-8, and when converted back to Latin1, it is now '\xe4',
ie that filename hass been corrupted!

Wrong. '\x61\x18' in Latin1, when converted to UTF-8 (NFD) is still '\x61\xc2\xa8'. You're mixing up DIARESIS (U+00A8) and COMBINING DIARESIS (U+0308).

I suspect this is why you've been yelling so much - you have a fundamental misunderstanding about what normalization is actually doing.

See? Normalization in the face of working together with others is a total
and utter mistake, and yes, it really *does* corrupt data. It makes it
fundamentally impossible to reliably work together with other encodings -
even when you do converstion between the two!

[ And that's the really sad part. Non-normalized Unicode can pretty much be used as a "generic encoding" for just about all locales - if you know
 the locale you convert from and to, you can generally use UTF-8 as an
internal format, knowing that you can always get the same result back in
 the original encoding. Normalization literally breaks that wonderful
 generic capability of Unicode.

And the fact that Unicode is such a "generic replacement" for any locale
 is exactly what makes it so wonderful, and allows you to fairly
 seamlessly convert piece-meal from some particular locale to Unicode:
even if you have some programs that still work in the original locale,
 you know that you can convert back to it without loss of information.

 Except if you normalize. In that case, you *do* lose information, and
 suddenly one of the best things about Unicode simply disappears.

See above as to why you're not losing the information you so fervently believe you are.

 As a result, people who force-normalize are idiots. But they seem to
also be stupid enough that they don't understand that they are idiots.
 Sad.

People who insult others run the risk of looking like a fool when shown to be wrong.

It's a bit like whitespace. Whitespace "doesn't matter" in text (== is
 equivalent), but an email client that force-normalizes whitespace in
 text is a really *broken* email client, because it turns out that
 sometimes even the "equivalent" forms simply do matter. Patches are
 text, but whitespace is meaningful there.

 Same exact deal: it's good to have the *ability* to normalize
 whitespace (in email, we call this "text=flowed" or similar), and in
 some ceses you might even want to make it the default action, but
 *forcing* normalization is total idiocy and actually makes the system
 less useful! ]

Sure, it all depends on what level you need to evaluate text. If we're talking about english paragraphs, then whitespace can be messed with. When we're talking about unicode strings, then specific encoding can be messed with. When talking about byte sequence, nothing can be messed with.

In our case, when working on an HFS+ filesystem all you have to care about is the unicode string level. The specific encoding can be messed with, and the client shouldn't care. Problems only arise when attempting to interoperate with filesystems that work at the byte sequence level.

The only information you lose when doing canonical normalization is what the original byte sequence was. Sure, this is a problem when working on a filesystem that cares about byte sequence, but it's not a problem when working on a filesystem that cares about the unicode string.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com


<<attachment: smime.p7s>>


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux