Re: git on MacOSX and files with decomposed utf-8 file names

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 21 Jan 2008 12:33:47 -0800 (PST)

On Mon, 21 Jan 2008, Kevin Ballard wrote:
> 
> > It's what shows up when you sha1sum, but it's also as simple as what shows
> > up when you do an "ls -l" and look at a file size.
> 
> It does? Why on earth should it do that? Filename doesn't contribute to the
> listed filesize on OS X.

Umm. What's this inability to see that data is data is data?

Why do you think Unicode has anything in particular to do with filenames?

Those same unicode strings are often part of the file data itself, and 
then that encoding damn well is visible in "ls -l".

Doing

	echo ä > file
	ls -l file

sure shows that "underlying octet" thing that you wanted to avoid so much. 
My point was that those underlying octets are always there, and they do 
matter. The fact that the differences may not be visible when you compare 
the normalized forms doesn't make it any less true.

You can choose to put blinders on and try to claim that normalization is 
invisible, but it's only invisible TO THOSE THINGS THAT DON'T WANT TO SEE 
IT.

But that doesn't change the fact that a lot of things *do* see it. There 
are very few things that are "Unicode specific", and a *lot* of tools that 
are just "general data tools".

And git tries to be a general data tool, not a Unicode-specific one.

> I'm not sure what you mean. The byte sequence is different from Latin1 to
> UTF-8 even if you use NFC, so I don't think, in this case, it makes any
> difference whether you use NFC or NFD.
>
> Yes, the codepoints are the same in Latin1 and UTF-8 if you use NFC, but 
> that's hardly relevant. Please correct me if I'm wrong, but I believe 
> Latin1->UTF-8->Latin1 conversion will always produce the same Latin1 
> text whether you use NFC or NFD.

The problem is that the UTF-8 form is different, so if you save things in 
UTF-8 (which we hopefully agree is a sane thing to do), then you should 
try to use a representation that people agree on.

And NFC is the more common normalization form by far, so by normalizing to 
something else, you actually de-normalize as far as those other people are 
concerned.

So if you have to normalize, at least use the normal form!

> The only reason it's particularly inconvenient is because it's different from
> what most other systems picked. And if you want to blame someone for that,
> blame Unicode for having so many different normalization forms.

I blame them for encouraging normalization at all.

It's stupid.

You don't need it.

The people who care about "are these strings equivalent" shouldn't do a 
"memcmp()" on them in the first place. And if you don't do a memcmp() on 
things, then you don't need to normalize. 

So you have two cases:
 (a) the cases that care about *identity*. They don't want normalization
 (b) the cases that care about *equivalence*. And they shouldn't do 
      octet-by-octet comparison.

See? Either you want to see equivalence, or you don't. And in neither case 
is normalization the right thing to do (except as *possibly* an internal 
part of the comparison, but there are actually better ways to check for 
equivalence than the brute-force "normalize both and compare results 
bitwise").

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html