On Jan 21, 2008, at 1:12 PM, Linus Torvalds wrote:
On Mon, 21 Jan 2008, Kevin Ballard wrote:On Jan 21, 2008, at 9:14 AM, Peter Karlsson wrote:I happen to prefer the text-as-string-of-characters (or code points,since you use the other meaning of characters in your posts), since Icome from the text world, having worked a lot on Unicode text processing.You apparently prefer the text-as-sequence-of-octets, which I tend todislike because I would have thought computer engineers would have evolved beyond this when we left the 1900s.I agree. Every single problem that I can recall Linus bringing up as aconsequence of HFS+ treating filenames as strings [..]You say "I agree", BUT YOU DON'T EVEN SEEM TO UNDERSTAND WHAT IS GOING ON.
I could say the same thing about you.
The fact is, text-as-string-of-codepoints (let's make the "codepoints"obvious, so that there is no ambiguity, but I'd also like to make it clear that a codepoint *is* how a Unicode character is defined, and a Unicode "string" is actually *defined* to be a sequence of codepoints, and totallyindependent of normalization!) is fine. That was never the issue at all. Unicode codepoints are wonderful. Now, git _also_ heavily depends on the actual encoding of those codepoints, since we create hashes etc, so in fact, as far ass git isconcerned, names have to be in some particular encoding to be hashed, andUTF-8 is the only sane encoding for Unicode. People can blather about UCS-2 and UTF-16 and UTF-32 all they want, but the fact is, UTF-8 issimply technically superior in so many ways that I don't even understandwhy anybody ever uses anything else. So I would not disagree with using UTF-8 at all. But that is *entirely* a separate issue from "normalization".Kevin, you seem to think that normalization is somehow forced on you bythe "text-as-codepoints" decision, and that is SIMPLY NOT TRUE. Normalization is a totally separate decision, and it's a STUPID one, because it breaks so many of the _nice_ properties of using UTF-8.
I'm not saying it's forced on you, I'm saying when you treat filenames as text, it DOESN'T MATTER if the string gets normalized. As long as the string remains equivalent, YOU DON'T CARE about the underlying byte stream.
And THAT is where we differ. It has nothing to do with "octets". It hasnothing to do with not liking Unicode. It has nothing to do with "strings". In short: - normalization is by no means required or even a good feature. It'ssomething you do when you want to know if two strings are equivalent,but that doesn't actually mean that you should keep the strings normalized all the time!
Alright, fine. I'm not saying HFS+ is right in storing the normalized version, but I do believe the authors of HFS+ must have had a reason to do that, and I also believe that it shouldn't make any difference to me since it remains equivalent.
- normalization has *nothing* to do with "treating text as octets". That's entirely an encoding issue.
Sure it does. Normalizing a string produces an equivalent string, and so unless I look at the octets the two strings are, for all intents and purposes, the same.
- of *course* git has to treat things as a binary stream at some point, since you need that to even compute a SHA1 in the first place, but thathas *nothing* to do with normalization or the lack of it.
You're right, but it doesn't have to treat it as a binary stream at the level I care about. I mean, no matter what you do at some level the string is evaluated as a binary stream. For our purposes, just redefine the hashing algorithm to hash all equivalent strings the same, and you can implement that by using SHA1 on a particular encoding of the string.
Got it? Forced normalization is stupid, because it changes the data andremoves information, and unless you know that change is safe, it's the wrong thing to do.
Decomposing and recomposing shouldn't lose any information we care about - when treating filenames as text, a<COMBINING DIARESIS> and <A WITH DIARESIS> are equivalent, and thus no distinction is made between them. I'm not sure what other information you might be considering lost in this case.
One reason _not_ to do normalization is that if you don't, you can still interact with no ambiguity with other non-Unicode locales. You can do the1:1 Latin1<->Unicode translation, and you *never* get into trouble. Incotnrast, if you normalize, it's no longer a 1:1 translation any more, and you can get into a situation where the translation from Latin1 to Unicode and back results in a *different* filename than the one you started with!
I don't believe you. See below.
See? That's a *serious*problem*. A system that forces normalization BYDEFINITION cannot work with people who use a Latin1 filesystem, because itwill corrupt the filenames! But you are apparently too damn stupid to understand that "datacorruption" == "bad", and too damn stupid to see that "Unicode" does notmean "Forced normalization".
When have I ever said that Unicode meant Forced normalization?
But I'll try one more time. Let's say that I work on a project where there are some people who use Latin1, and some people who use UTF-8, and we use special characters. It should all work, as long as we use only the commonsubset, and we teach git to convert to UTF-8 as a common base. Right? In your *idiotic* world, where you have to normalize and corruptingfilenames is ok, that doesn't work! It works wonderfully well if you do the obvious 1:1 translation and you do *not* normalize, but the moment youstart normalizing, you actually corrupt the filenames!
Wrong.
And yes, the character sequence 'a¨' is exactly one such sequence. It'sperfectly representable in both Latin1 and in UTF-8: in latin1 it is atwo-character '\x61\xa8', and when doing a Latin1->UTF-8 conversion, it becomes '\x61\xc2\xa8', and you can convert back and forth between thosetwo forms an infinite amount of times, and you never corrupt it.But the moment you add normalization to the mix, you start screwing up.Suddenly, the sequence '\x61\xa8' in Latin1 becomes (assuming NFD)'\xc3\xa4' in UTF-8, and when converted back to Latin1, it is now '\xe4',ie that filename hass been corrupted!
Wrong. '\x61\x18' in Latin1, when converted to UTF-8 (NFD) is still '\x61\xc2\xa8'. You're mixing up DIARESIS (U+00A8) and COMBINING DIARESIS (U+0308).
I suspect this is why you've been yelling so much - you have a fundamental misunderstanding about what normalization is actually doing.
See? Normalization in the face of working together with others is a totaland utter mistake, and yes, it really *does* corrupt data. It makes itfundamentally impossible to reliably work together with other encodings -even when you do converstion between the two![ And that's the really sad part. Non-normalized Unicode can pretty much be used as a "generic encoding" for just about all locales - if you knowthe locale you convert from and to, you can generally use UTF-8 as aninternal format, knowing that you can always get the same result back inthe original encoding. Normalization literally breaks that wonderful generic capability of Unicode.And the fact that Unicode is such a "generic replacement" for any localeis exactly what makes it so wonderful, and allows you to fairly seamlessly convert piece-meal from some particular locale to Unicode:even if you have some programs that still work in the original locale,you know that you can convert back to it without loss of information. Except if you normalize. In that case, you *do* lose information, and suddenly one of the best things about Unicode simply disappears.
See above as to why you're not losing the information you so fervently believe you are.
As a result, people who force-normalize are idiots. But they seem toalso be stupid enough that they don't understand that they are idiots.Sad.
People who insult others run the risk of looking like a fool when shown to be wrong.
It's a bit like whitespace. Whitespace "doesn't matter" in text (== isequivalent), but an email client that force-normalizes whitespace in text is a really *broken* email client, because it turns out that sometimes even the "equivalent" forms simply do matter. Patches are text, but whitespace is meaningful there. Same exact deal: it's good to have the *ability* to normalize whitespace (in email, we call this "text=flowed" or similar), and in some ceses you might even want to make it the default action, but *forcing* normalization is total idiocy and actually makes the system less useful! ]
Sure, it all depends on what level you need to evaluate text. If we're talking about english paragraphs, then whitespace can be messed with. When we're talking about unicode strings, then specific encoding can be messed with. When talking about byte sequence, nothing can be messed with.
In our case, when working on an HFS+ filesystem all you have to care about is the unicode string level. The specific encoding can be messed with, and the client shouldn't care. Problems only arise when attempting to interoperate with filesystems that work at the byte sequence level.
The only information you lose when doing canonical normalization is what the original byte sequence was. Sure, this is a problem when working on a filesystem that cares about byte sequence, but it's not a problem when working on a filesystem that cares about the unicode string.
-Kevin Ballard -- Kevin Ballard http://kevin.sb.org kevin@xxxxxx http://www.tildesoft.com
<<attachment: smime.p7s>>