Re: git on MacOSX and files with decomposed utf-8 file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Jan 21, 2008, at 3:33 PM, Linus Torvalds wrote:

On Mon, 21 Jan 2008, Kevin Ballard wrote:

It's what shows up when you sha1sum, but it's also as simple as what shows
up when you do an "ls -l" and look at a file size.

It does? Why on earth should it do that? Filename doesn't contribute to the
listed filesize on OS X.

Umm. What's this inability to see that data is data is data?

I'm not sure what you mean. I stated a fact - at least on OS X, the filename does not contribute to the listed filesize, so changing the encoding of the filename doesn't change the filesize. This isn't a philosophical point, it's a factual statement.

Why do you think Unicode has anything in particular to do with filenames?

I don't, but I do think this discussion revolves around filenames, therefore it should not surprise you when I talk about filenames.

Those same unicode strings are often part of the file data itself, and
then that encoding damn well is visible in "ls -l".

Doing

	echo ä > file
	ls -l file

sure shows that "underlying octet" thing that you wanted to avoid so much. My point was that those underlying octets are always there, and they do matter. The fact that the differences may not be visible when you compare
the normalized forms doesn't make it any less true.

Yes, I am well aware that the encoding of the *file contents* affects filesize. But when did I suggest changing the encoding of filenames inside file contents? If you treat filenames as strings, there's no requirement to change the encoding of filenames inside file contents. I'm talking specifically about the filenames, not about file contents, so stop trying to argue against that which is irrelevant.

You can choose to put blinders on and try to claim that normalization is invisible, but it's only invisible TO THOSE THINGS THAT DON'T WANT TO SEE
IT.

Don't want to, or don't need to? It's not a matter of ignoring encoding because I don't want to deal with it, it's ignoring encoding because it's simply not relevant if I treat filenames as strings.

But that doesn't change the fact that a lot of things *do* see it. There are very few things that are "Unicode specific", and a *lot* of tools that
are just "general data tools".

And git tries to be a general data tool, not a Unicode-specific one.

Yes, I realize that. See my previous message about discussing ideal vs practicality.

I'm not sure what you mean. The byte sequence is different from Latin1 to UTF-8 even if you use NFC, so I don't think, in this case, it makes any
difference whether you use NFC or NFD.

Yes, the codepoints are the same in Latin1 and UTF-8 if you use NFC, but
that's hardly relevant. Please correct me if I'm wrong, but I believe
Latin1->UTF-8->Latin1 conversion will always produce the same Latin1
text whether you use NFC or NFD.

The problem is that the UTF-8 form is different, so if you save things in UTF-8 (which we hopefully agree is a sane thing to do), then you should
try to use a representation that people agree on.

And NFC is the more common normalization form by far, so by normalizing to something else, you actually de-normalize as far as those other people are
concerned.

So if you have to normalize, at least use the normal form!

Was NFC the common normalization form back in 1998? My understanding is Unicode was still in the process of being adopted back then, so there was no one common standard that was obvious for everyone to use.

The only reason it's particularly inconvenient is because it's different from what most other systems picked. And if you want to blame someone for that,
blame Unicode for having so many different normalization forms.

I blame them for encouraging normalization at all.

It's stupid.

You don't need it.

The people who care about "are these strings equivalent" shouldn't do a "memcmp()" on them in the first place. And if you don't do a memcmp() on
things, then you don't need to normalize.

So you have two cases:
(a) the cases that care about *identity*. They don't want normalization
(b) the cases that care about *equivalence*. And they shouldn't do
     octet-by-octet comparison.

See? Either you want to see equivalence, or you don't. And in neither case is normalization the right thing to do (except as *possibly* an internal part of the comparison, but there are actually better ways to check for
equivalence than the brute-force "normalize both and compare results
bitwise").

I could argue against this, but frankly, I'm really tired of arguing this same point. I suggest we simply agree to disagree, and move on to actually fixing the problem.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com


<<attachment: smime.p7s>>


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux