Re: git on MacOSX and files with decomposed utf-8 file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Jan 16, 2008, at 5:32 PM, Linus Torvalds wrote:

On Wed, 16 Jan 2008, Kevin Ballard wrote:

It's not using different encodings, it's all Unicode. However, it accepts different normalization variants of Unicode, since it can read them all and it would be folly to require everybody to conform to its own special internal variant. But it does have to normalize them, otherwise how would it detect the
same filename using different normalizations?

That's a singularly *stupid* argument.

Here, let me rephrase that same idiotic argument:

"But it does have to uppercase them, otherwise how would it detect the
  same filename using different cases?"

..and if you don't see how that's *exactly* the same argument, you really
are stupid.

You're right, it doesn't actually have to store the normalized form. And yes, it's possible to compare without normalizing them. Admittedly, I don't know much about the implementation details of unicode, but I would assume that the easiest way to compare two strings is to normalize them first. But in the case of the filesystem, normalization actually is important if you're thinking about filenames in terms of characters rather than bytes. When I feed the filesystem a given unicode string, it has to find the file I'm talking about - should it do a relatively expensive unicode-sensitive comparison of all the filenames with the one I gave it, or should it just normalize all names and do the much cheaper lookup that way? I don't know about you, but I'd prefer to let my filesystem normalize the name and run faster.

The fact is, normalization is wrong.

It's wrong when you normalize upper/lower case (no, the word "Polish" is not the same as "polish"), and it's equally wrong when you normalize for
"looks similar".

There's a difference between "looks similar" as in "Polish" vs "polish", and actually is the same string as in "Ma<UMLAUT MODIFIER>rchen" vs "M<A WITH UMLAUT>rchen". Capitalization has a valid semantic meaning, normalization doesn't. The only way to argue that normalization is wrong is by providing a good reason to preserve the exact byte sequence, and so far the only reason I've seen is to help git. Applications in general don't care one whit about the byte sequence of the filename, they care about the underlying file the name represents. Additionally, it would be a terrible experience for a user to enter "Märchen" and have the application say "sorry, I can't find this file" simply because the application used decomposed characters and the filename used composed characters. Unless the user is knowledgeable about the OS, filesystems, and unicode, they wouldn't have a hope of figuring out what the problem was.


In other words, you're used to filenames as bytes, HFS+ treats filenames
as strings.

No. HFS+ treats users as idiots and thinks that it should "fix" the
filename for them. And it causes problems.

How do you figure? When I type "Märchen", I'm typing a string, not a byte sequence. I have no control over the normalization of the characters. Therefore, depending on what program I'm typing the name in, I might use the same normalization as the filename, or I might miss. It's completely out of my control. This is why the filesystem has to step in and say "You composed that character differently, but I know you were trying to specify this file".

It causes problems for exactly the same reasons case-independence causes problems, because it's EXACTLY THE SAME ISSUE. People may think that "but
they are the same", but they aren't. Case matters. And so does "single
character" vs "two character overlay".

There are valid reasons for case to matter, but what reason is there for "single character" vs" two character overlay" to matter in filenames? They're different representations of the exact same string, and that's what a filename is - a string.

It seems like your arguments stem from the assumption that the user cares about the byte sequence that represents the filename, which is wrong. The user has no idea what the byte sequence is - the user cares about the string. Normalization is meant to help computers, not users, and claiming that different normalizations of the same string produces different meaningful strings is complete bunk.

If you were to have two different files on your system, both of them called "Märchen", but one precomposed and one decomposed, how would you specify which one you wanted? Unless Linux has a special text input system which gives the user control over the normalization of their typed characters, you'd have to write out the UTF-8 bytes manually.

I just don't understand this insistence on treating the specific byte sequence that makes up the filename as significant.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com


<<attachment: smime.p7s>>


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux