Re: git on MacOSX and files with decomposed utf-8 file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Jan 16, 2008, at 6:38 PM, Linus Torvalds wrote:

On Wed, 16 Jan 2008, Kevin Ballard wrote:

There's a difference between "looks similar" as in "Polish" vs "polish", and actually is the same string as in "Ma<UMLAUT MODIFIER>rchen" vs "M<A WITH UMLAUT>rchen". Capitalization has a valid semantic meaning, normalization
doesn't.

That simply isn't true.

Normalization actually has real semantic meaning. If it didn't, there
would never ever be a reason why you'd use the non-normalized form in the
first place.

My understanding is that normalization is there to help the computer. That doesn't give it any semantic meaning, because all normal forms of a given string still represent the exact same string to the user.

Others have argued the exact same thing for capitalization. "A" is the
same letter as "a". Except there is a distinction.

The argument for case insensitivity is different than the argument for normalization. I certainly hope you understand why they are different arguments, or there's really no point in going further.

The same is true of "a<UMLAUT MODIFIER>" and "<a WITH UMLAUT>". Yes, it's
the same "chacter" in either case. Except when there is a distinction.

And there *are* cases where there are distinctions. Especially inside
computers. For one thing, you may not be talking about "characters on
screen", but you may be talking about "key sequences". And suddenly
"a<UMLAUT MODIFIER>" is a two-key sequence, and "<a WITH UMLAUT>" is a
single-key sequence, and THEY ARE DIFFERENT.

See?

"a" and "A" are the same letter. But sometimes case matters.

Multi-character UTF-8 sequences may be the same character. But sometimes
the sequence matters.

Same exact thing.

You're right, sometimes the sequence matters. As in key sequences. But we're not talking about key sequences, we're talking about strings. Just because it matters sometimes doesn't mean it matters all the time.


	The only way to argue that normalization is wrong is by providing a
good reason to preserve the exact byte sequence, and so far the only reason
I've seen is to help git.

Git doesn't care. Just use the *same* sequence everywhere. Make sure
something doesn't change it. Because if something changes it, git will
track it.

And how am I supposed to use the same sequence everywhere? When I type "Märchen", I don't know which form I'm typing, nor should I. It's not something that I, as a user, should have to know. Especially if I pass this name through various other utilities before using it - I have no idea if another utility is going to end up normalizing the name, and it shouldn't matter, as they are equivalent strings.

How do you figure? When I type "Märchen", I'm typing a string, not a byte
sequence. I have no control over the normalization of the characters.
Therefore, depending on what program I'm typing the name in, I might use the same normalization as the filename, or I might miss. It's completely out of my control. This is why the filesystem has to step in and say "You composed that character differently, but I know you were trying to specify this file".

Pure and utter garbage.

What you are describing is an *input method* issue, not a filesystem
issue.

The fact that you think this has anything what-so-ever to do with
filesystems, I cannot understand.

Here's an example: I can type Märchen two different ways on my keyboard: I can press the 'ä' key (yes, I have one, I have a Swedish keyboard), or I
could press the '¨' key and the 'a' key.

See: I get 'ä' and 'ä' respectively.

On a US keyboard I only have one way of typing ä, and I have no idea whether it ends up precomposed or decomposed in the resulting byte stream. And I don't care. Because I'm typing characters, not bytes. I could be typing in a file in ISO-Latin-1 and I still wouldn't care, because it looks the same to me. If my filesystem did make a distinction between the normal forms, and I see that I have a file named "Märchen", how am I supposed to type that at my keyboard? I don't know which normal form it's using.

The fact that you think the normalization of the string matters, I don't understand.

And as I send this email off, those characters never *ever* got written as
filenames to any filesystem. But they *did* get written as part of
text-files to the disk using "write()", yes.

And according to your *insane* logic, that write() call should have
converted them to the same representation, no?


Hell no! That conversion has absolutely nothing to do with the filesystem.
It's done at a totally different layer that actually knows what it is
doing, and turned them both into \xc3\xa4 (and then, the email client
probably will turn this into Latin1, and send it out as a single-byte
'\xe4' character).

See? Putting the conversion in the filesystem IS INSANE. You wouldn't make the filesystem convert the characters in the data stream (because it would cause strange data conversion issues) AND FOR EXACTLY THE SAME REASON it
shouldn't do it for filenames either!

What a fabulous straw man argument you just put together. I hope you don't need me to point out why this argument is fundamentally flawed.

And your claim that "you have no control over the normalization of
characters" is simply insane. Of course you have. It's just not supposed to be at the filesystem level - whether it's a write() call or a creat()
call!

I'm speaking as a user, and as such, I shouldn't even have to know that it's possible to write the same character in multiple different ways. As a user, HFS+ behaves exactly the way I want it to. You were talking earlier about not messing with the "user data", but what is the "user data"? It's the string, not the byte sequence. That's all I care about - the string. That's all the OS cares about, that's all any application I use cares about, and that's all git should care about.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com


<<attachment: smime.p7s>>


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux