On Mon, 21 Jan 2008, Kevin Ballard wrote: > > > It's what shows up when you sha1sum, but it's also as simple as what shows > > up when you do an "ls -l" and look at a file size. > > It does? Why on earth should it do that? Filename doesn't contribute to the > listed filesize on OS X. Umm. What's this inability to see that data is data is data? Why do you think Unicode has anything in particular to do with filenames? Those same unicode strings are often part of the file data itself, and then that encoding damn well is visible in "ls -l". Doing echo ä > file ls -l file sure shows that "underlying octet" thing that you wanted to avoid so much. My point was that those underlying octets are always there, and they do matter. The fact that the differences may not be visible when you compare the normalized forms doesn't make it any less true. You can choose to put blinders on and try to claim that normalization is invisible, but it's only invisible TO THOSE THINGS THAT DON'T WANT TO SEE IT. But that doesn't change the fact that a lot of things *do* see it. There are very few things that are "Unicode specific", and a *lot* of tools that are just "general data tools". And git tries to be a general data tool, not a Unicode-specific one. > I'm not sure what you mean. The byte sequence is different from Latin1 to > UTF-8 even if you use NFC, so I don't think, in this case, it makes any > difference whether you use NFC or NFD. > > Yes, the codepoints are the same in Latin1 and UTF-8 if you use NFC, but > that's hardly relevant. Please correct me if I'm wrong, but I believe > Latin1->UTF-8->Latin1 conversion will always produce the same Latin1 > text whether you use NFC or NFD. The problem is that the UTF-8 form is different, so if you save things in UTF-8 (which we hopefully agree is a sane thing to do), then you should try to use a representation that people agree on. And NFC is the more common normalization form by far, so by normalizing to something else, you actually de-normalize as far as those other people are concerned. So if you have to normalize, at least use the normal form! > The only reason it's particularly inconvenient is because it's different from > what most other systems picked. And if you want to blame someone for that, > blame Unicode for having so many different normalization forms. I blame them for encouraging normalization at all. It's stupid. You don't need it. The people who care about "are these strings equivalent" shouldn't do a "memcmp()" on them in the first place. And if you don't do a memcmp() on things, then you don't need to normalize. So you have two cases: (a) the cases that care about *identity*. They don't want normalization (b) the cases that care about *equivalence*. And they shouldn't do octet-by-octet comparison. See? Either you want to see equivalence, or you don't. And in neither case is normalization the right thing to do (except as *possibly* an internal part of the comparison, but there are actually better ways to check for equivalence than the brute-force "normalize both and compare results bitwise"). Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html