Re: git on MacOSX and files with decomposed utf-8 file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Jan 21, 2008, at 2:41 PM, Linus Torvalds wrote:

On Mon, 21 Jan 2008, Kevin Ballard wrote:

I'm not saying it's forced on you, I'm saying when you treat filenames as text, it DOESN'T MATTER if the string gets normalized. As long as the string
remains equivalent, YOU DON'T CARE about the underlying byte stream.

Sure I do, because it matters a lot for things like - wait for it - things
like checksumming it.

I believe I already responded to the issue of hashing. In summary, just re-define your hash function to convert the string to a specific encoding. Sure, you'll lose some speed, but we're already assuming that it's worth taking a speed hit in order to treat filenames as strings (please don't argue this point, it's an opinion, not a factual statement, and I'm not necessarily saying I agree with it, I'm just saying it's valid).

Alright, fine. I'm not saying HFS+ is right in storing the normalized version, but I do believe the authors of HFS+ must have had a reason to do that, and I also believe that it shouldn't make any difference to me since it remains
equivalent.

I've already told you the reason: they did the mistake of wanting to be
case-independent, and a (bad) case compare is easier in NFD.

Once you give strings semantic meaning (and "case independent" implies
that semantic meaning), suddenly normalization looks like a good idea, and
since you're going to corrupt the data *anyway*, who cares? You just
created a file like "Hello", and readdir() returns "hello" (because there was an old file under that name), and it's a lot more obviously corrupt
than just due to normalization.

Perhaps that is the reason, I don't know (neither do you, you're just guessing). However, my point still stands - as long as the string stays canonically equivalent, it doesn't matter to me if the filesystem changes the encoding, since I'm working at the string level.

Sure it does. Normalizing a string produces an equivalent string, and so unless I look at the octets the two strings are, for all intents and purposes,
the same.

.. but you *have* to look at the octets at some point. They're kind of
what the string is built up of. They never went away, even if you chose to ignore them. The encoding is really quite important, and is visible both
in memory and on disk.

Someone has to look at the octets, but it doesn't have to be me. As long as I use unicode-aware libraries and such, I can let the underlying system care about the byte order and my code will be clean.

It's what shows up when you sha1sum, but it's also as simple as what shows
up when you do an "ls -l" and look at a file size.

It does? Why on earth should it do that? Filename doesn't contribute to the listed filesize on OS X.

kevin@KBLAPTOP:~> echo foo > foo; echo foo > foobar
kevin@KBLAPTOP:~> ls -l foo*
-rw-r--r--  1 kevin  kevin  4 Jan 21 14:50 foo
-rw-r--r--  1 kevin  kevin  4 Jan 21 14:50 foobar

It would be singularly stupid for the filesize to reflect the filename, especially since this means you would report different filesizes for hardlinks.

It doesn't matter if the text is "equivalent", when you then see the
differences in all these small details.

You can shut your eyes as much as you want, and say that you don't care,
but the differences are real, and they are visible.

Visible at some level, sure, but not visible at the level my code works on. And thus, I don't have to care about it.

Decomposing and recomposing shouldn't lose any information we care about - when treating filenames as text, a<COMBINING DIARESIS> and <A WITH DIARESIS> are equivalent, and thus no distinction is made between them. I'm not sure
what other information you might be considering lost in this case.

You're right, I messed up. I used a non-combining diaeresis, and you're right, it doesn't get corrupted. And I think that means that if Apple had used NFC, we'd not have this problem with Latin1 systems (because then the
UTF-8 representation would be the same).

I'm not sure what you mean. The byte sequence is different from Latin1 to UTF-8 even if you use NFC, so I don't think, in this case, it makes any difference whether you use NFC or NFD. Yes, the codepoints are the same in Latin1 and UTF-8 if you use NFC, but that's hardly relevant. Please correct me if I'm wrong, but I believe Latin1->UTF-8->Latin1 conversion will always produce the same Latin1 text whether you use NFC or NFD.

So I still think that normalization is totally idiotic, but the thing that actually causes most problems for people on OS X is that they chose the
really inconvenient one.

The only reason it's particularly inconvenient is because it's different from what most other systems picked. And if you want to blame someone for that, blame Unicode for having so many different normalization forms.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com


<<attachment: smime.p7s>>


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux