Re: git on MacOSX and files with decomposed utf-8 file names

Kevin Ballard <kevin@xxxxxx> · Mon, 21 Jan 2008 15:53:25 -0500

On Jan 21, 2008, at 3:33 PM, Linus Torvalds wrote:

On Mon, 21 Jan 2008, Kevin Ballard wrote:

It's what shows up when you sha1sum, but it's also as simple as  
what shows
up when you do an "ls -l" and look at a file size.

It does? Why on earth should it do that? Filename doesn't  
contribute to the
listed filesize on OS X.

Umm. What's this inability to see that data is data is data?

I'm not sure what you mean. I stated a fact - at least on OS X, the  
filename does not contribute to the listed filesize, so changing the  
encoding of the filename doesn't change the filesize. This isn't a  
philosophical point, it's a factual statement.

Why do you think Unicode has anything in particular to do with  
filenames?

I don't, but I do think this discussion revolves around filenames,  
therefore it should not surprise you when I talk about filenames.

Those same unicode strings are often part of the file data itself, and
then that encoding damn well is visible in "ls -l".

Doing

	echo ä > file
	ls -l file

sure shows that "underlying octet" thing that you wanted to avoid so  
much.
My point was that those underlying octets are always there, and they  
do
matter. The fact that the differences may not be visible when you  
compare
the normalized forms doesn't make it any less true.

Yes, I am well aware that the encoding of the *file contents* affects  
filesize. But when did I suggest changing the encoding of filenames  
inside file contents? If you treat filenames as strings, there's no  
requirement to change the encoding of filenames inside file contents.  
I'm talking specifically about the filenames, not about file contents,  
so stop trying to argue against that which is irrelevant.

You can choose to put blinders on and try to claim that  
normalization is
invisible, but it's only invisible TO THOSE THINGS THAT DON'T WANT  
TO SEE
IT.

Don't want to, or don't need to? It's not a matter of ignoring  
encoding because I don't want to deal with it, it's ignoring encoding  
because it's simply not relevant if I treat filenames as strings.

But that doesn't change the fact that a lot of things *do* see it.  
There
are very few things that are "Unicode specific", and a *lot* of  
tools that
are just "general data tools".

And git tries to be a general data tool, not a Unicode-specific one.

Yes, I realize that. See my previous message about discussing ideal vs  
practicality.

I'm not sure what you mean. The byte sequence is different from  
Latin1 to
UTF-8 even if you use NFC, so I don't think, in this case, it makes  
any
difference whether you use NFC or NFD.

Yes, the codepoints are the same in Latin1 and UTF-8 if you use  
NFC, but
that's hardly relevant. Please correct me if I'm wrong, but I believe
Latin1->UTF-8->Latin1 conversion will always produce the same Latin1
text whether you use NFC or NFD.

The problem is that the UTF-8 form is different, so if you save  
things in
UTF-8 (which we hopefully agree is a sane thing to do), then you  
should
try to use a representation that people agree on.

And NFC is the more common normalization form by far, so by  
normalizing to
something else, you actually de-normalize as far as those other  
people are
concerned.

So if you have to normalize, at least use the normal form!

Was NFC the common normalization form back in 1998? My understanding  
is Unicode was still in the process of being adopted back then, so  
there was no one common standard that was obvious for everyone to use.

The only reason it's particularly inconvenient is because it's  
different from
what most other systems picked. And if you want to blame someone  
for that,
blame Unicode for having so many different normalization forms.

I blame them for encouraging normalization at all.

It's stupid.

You don't need it.

The people who care about "are these strings equivalent" shouldn't  
do a
"memcmp()" on them in the first place. And if you don't do a  
memcmp() on
things, then you don't need to normalize.

So you have two cases:
(a) the cases that care about *identity*. They don't want  
normalization
(b) the cases that care about *equivalence*. And they shouldn't do
     octet-by-octet comparison.

See? Either you want to see equivalence, or you don't. And in  
neither case
is normalization the right thing to do (except as *possibly* an  
internal
part of the comparison, but there are actually better ways to check  
for
equivalence than the brute-force "normalize both and compare results
bitwise").

I could argue against this, but frankly, I'm really tired of arguing  
this same point. I suggest we simply agree to disagree, and move on to  
actually fixing the problem.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com

<<attachment: smime.p7s>>