Re: git on MacOSX and files with decomposed utf-8 file names

Kevin Ballard <kevin@xxxxxx> · Wed, 16 Jan 2008 18:11:31 -0500

On Jan 16, 2008, at 5:32 PM, Linus Torvalds wrote:

On Wed, 16 Jan 2008, Kevin Ballard wrote:

It's not using different encodings, it's all Unicode. However, it  
accepts
different normalization variants of Unicode, since it can read them  
all and it
would be folly to require everybody to conform to its own special  
internal
variant. But it does have to normalize them, otherwise how would it  
detect the
same filename using different normalizations?

That's a singularly *stupid* argument.

Here, let me rephrase that same idiotic argument:

 "But it does have to uppercase them, otherwise how would it detect  
the
  same filename using different cases?"

..and if you don't see how that's *exactly* the same argument, you  
really
are stupid.

You're right, it doesn't actually have to store the normalized form.  
And yes, it's possible to compare without normalizing them.  
Admittedly, I don't know much about the implementation details of  
unicode, but I would assume that the easiest way to compare two  
strings is to normalize them first. But in the case of the filesystem,  
normalization actually is important if you're thinking about filenames  
in terms of characters rather than bytes. When I feed the filesystem a  
given unicode string, it has to find the file I'm talking about -  
should it do a relatively expensive unicode-sensitive comparison of  
all the filenames with the one I gave it, or should it just normalize  
all names and do the much cheaper lookup that way? I don't know about  
you, but I'd prefer to let my filesystem normalize the name and run  
faster.

The fact is, normalization is wrong.

It's wrong when you normalize upper/lower case (no, the word  
"Polish" is
not the same as "polish"), and it's equally wrong when you normalize  
for
"looks similar".

There's a difference between "looks similar" as in "Polish" vs  
"polish", and actually is the same string as in "Ma<UMLAUT  
MODIFIER>rchen" vs "M<A WITH UMLAUT>rchen". Capitalization has a valid  
semantic meaning, normalization doesn't. The only way to argue that  
normalization is wrong is by providing a good reason to preserve the  
exact byte sequence, and so far the only reason I've seen is to help  
git. Applications in general don't care one whit about the byte  
sequence of the filename, they care about the underlying file the name  
represents. Additionally, it would be a terrible experience for a user  
to enter "Märchen" and have the application say "sorry, I can't find  
this file" simply because the application used decomposed characters  
and the filename used composed characters. Unless the user is  
knowledgeable about the OS, filesystems, and unicode, they wouldn't  
have a hope of figuring out what the problem was.

In other words, you're used to filenames as bytes, HFS+ treats  
filenames
as strings.

No. HFS+ treats users as idiots and thinks that it should "fix" the
filename for them. And it causes problems.

How do you figure? When I type "Märchen", I'm typing a string, not a  
byte sequence. I have no control over the normalization of the  
characters. Therefore, depending on what program I'm typing the name  
in, I might use the same normalization as the filename, or I might  
miss. It's completely out of my control. This is why the filesystem  
has to step in and say "You composed that character differently, but I  
know you were trying to specify this file".

It causes problems for exactly the same reasons case-independence  
causes
problems, because it's EXACTLY THE SAME ISSUE. People may think that  
"but
they are the same", but they aren't. Case matters. And so does "single
character" vs "two character overlay".

There are valid reasons for case to matter, but what reason is there  
for "single character" vs" two character overlay" to matter in  
filenames? They're different representations of the exact same string,  
and that's what a filename is - a string.

It seems like your arguments stem from the assumption that the user  
cares about the byte sequence that represents the filename, which is  
wrong. The user has no idea what the byte sequence is - the user cares  
about the string. Normalization is meant to help computers, not users,  
and claiming that different normalizations of the same string produces  
different meaningful strings is complete bunk.

If you were to have two different files on your system, both of them  
called "Märchen", but one precomposed and one decomposed, how would  
you specify which one you wanted? Unless Linux has a special text  
input system which gives the user control over the normalization of  
their typed characters, you'd have to write out the UTF-8 bytes  
manually.

I just don't understand this insistence on treating the specific byte  
sequence that makes up the filename as significant.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com

<<attachment: smime.p7s>>