On Mon, 21 Jan 2008, Kevin Ballard wrote: > On Jan 21, 2008, at 9:14 AM, Peter Karlsson wrote: > > > > I happen to prefer the text-as-string-of-characters (or code points, > > since you use the other meaning of characters in your posts), since I > > come from the text world, having worked a lot on Unicode text > > processing. > > > > You apparently prefer the text-as-sequence-of-octets, which I tend to > > dislike because I would have thought computer engineers would have > > evolved beyond this when we left the 1900s. > > I agree. Every single problem that I can recall Linus bringing up as a > consequence of HFS+ treating filenames as strings [..] You say "I agree", BUT YOU DON'T EVEN SEEM TO UNDERSTAND WHAT IS GOING ON. The fact is, text-as-string-of-codepoints (let's make the "codepoints" obvious, so that there is no ambiguity, but I'd also like to make it clear that a codepoint *is* how a Unicode character is defined, and a Unicode "string" is actually *defined* to be a sequence of codepoints, and totally independent of normalization!) is fine. That was never the issue at all. Unicode codepoints are wonderful. Now, git _also_ heavily depends on the actual encoding of those codepoints, since we create hashes etc, so in fact, as far ass git is concerned, names have to be in some particular encoding to be hashed, and UTF-8 is the only sane encoding for Unicode. People can blather about UCS-2 and UTF-16 and UTF-32 all they want, but the fact is, UTF-8 is simply technically superior in so many ways that I don't even understand why anybody ever uses anything else. So I would not disagree with using UTF-8 at all. But that is *entirely* a separate issue from "normalization". Kevin, you seem to think that normalization is somehow forced on you by the "text-as-codepoints" decision, and that is SIMPLY NOT TRUE. Normalization is a totally separate decision, and it's a STUPID one, because it breaks so many of the _nice_ properties of using UTF-8. And THAT is where we differ. It has nothing to do with "octets". It has nothing to do with not liking Unicode. It has nothing to do with "strings". In short: - normalization is by no means required or even a good feature. It's something you do when you want to know if two strings are equivalent, but that doesn't actually mean that you should keep the strings normalized all the time! - normalization has *nothing* to do with "treating text as octets". That's entirely an encoding issue. - of *course* git has to treat things as a binary stream at some point, since you need that to even compute a SHA1 in the first place, but that has *nothing* to do with normalization or the lack of it. Got it? Forced normalization is stupid, because it changes the data and removes information, and unless you know that change is safe, it's the wrong thing to do. One reason _not_ to do normalization is that if you don't, you can still interact with no ambiguity with other non-Unicode locales. You can do the 1:1 Latin1<->Unicode translation, and you *never* get into trouble. In cotnrast, if you normalize, it's no longer a 1:1 translation any more, and you can get into a situation where the translation from Latin1 to Unicode and back results in a *different* filename than the one you started with! See? That's a *serious*problem*. A system that forces normalization BY DEFINITION cannot work with people who use a Latin1 filesystem, because it will corrupt the filenames! But you are apparently too damn stupid to understand that "data corruption" == "bad", and too damn stupid to see that "Unicode" does not mean "Forced normalization". But I'll try one more time. Let's say that I work on a project where there are some people who use Latin1, and some people who use UTF-8, and we use special characters. It should all work, as long as we use only the common subset, and we teach git to convert to UTF-8 as a common base. Right? In your *idiotic* world, where you have to normalize and corrupting filenames is ok, that doesn't work! It works wonderfully well if you do the obvious 1:1 translation and you do *not* normalize, but the moment you start normalizing, you actually corrupt the filenames! And yes, the character sequence 'a¨' is exactly one such sequence. It's perfectly representable in both Latin1 and in UTF-8: in latin1 it is a two-character '\x61\xa8', and when doing a Latin1->UTF-8 conversion, it becomes '\x61\xc2\xa8', and you can convert back and forth between those two forms an infinite amount of times, and you never corrupt it. But the moment you add normalization to the mix, you start screwing up. Suddenly, the sequence '\x61\xa8' in Latin1 becomes (assuming NFD) '\xc3\xa4' in UTF-8, and when converted back to Latin1, it is now '\xe4', ie that filename hass been corrupted! See? Normalization in the face of working together with others is a total and utter mistake, and yes, it really *does* corrupt data. It makes it fundamentally impossible to reliably work together with other encodings - even when you do converstion between the two! [ And that's the really sad part. Non-normalized Unicode can pretty much be used as a "generic encoding" for just about all locales - if you know the locale you convert from and to, you can generally use UTF-8 as an internal format, knowing that you can always get the same result back in the original encoding. Normalization literally breaks that wonderful generic capability of Unicode. And the fact that Unicode is such a "generic replacement" for any locale is exactly what makes it so wonderful, and allows you to fairly seamlessly convert piece-meal from some particular locale to Unicode: even if you have some programs that still work in the original locale, you know that you can convert back to it without loss of information. Except if you normalize. In that case, you *do* lose information, and suddenly one of the best things about Unicode simply disappears. As a result, people who force-normalize are idiots. But they seem to also be stupid enough that they don't understand that they are idiots. Sad. It's a bit like whitespace. Whitespace "doesn't matter" in text (== is equivalent), but an email client that force-normalizes whitespace in text is a really *broken* email client, because it turns out that sometimes even the "equivalent" forms simply do matter. Patches are text, but whitespace is meaningful there. Same exact deal: it's good to have the *ability* to normalize whitespace (in email, we call this "text=flowed" or similar), and in some ceses you might even want to make it the default action, but *forcing* normalization is total idiocy and actually makes the system less useful! ] Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html