Re: git on MacOSX and files with decomposed utf-8 file names

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 16 Jan 2008 15:38:35 -0800 (PST)

On Wed, 16 Jan 2008, Kevin Ballard wrote:
> 
> There's a difference between "looks similar" as in "Polish" vs "polish", and
> actually is the same string as in "Ma<UMLAUT MODIFIER>rchen" vs "M<A WITH
> UMLAUT>rchen". Capitalization has a valid semantic meaning, normalization
> doesn't. 

That simply isn't true.

Normalization actually has real semantic meaning. If it didn't, there 
would never ever be a reason why you'd use the non-normalized form in the 
first place.

Others have argued the exact same thing for capitalization. "A" is the 
same letter as "a". Except there is a distinction.

The same is true of "a<UMLAUT MODIFIER>" and "<a WITH UMLAUT>". Yes, it's 
the same "chacter" in either case. Except when there is a distinction.

And there *are* cases where there are distinctions. Especially inside 
computers. For one thing, you may not be talking about "characters on 
screen", but you may be talking about "key sequences". And suddenly 
"a<UMLAUT MODIFIER>" is a two-key sequence, and "<a WITH UMLAUT>" is a 
single-key sequence, and THEY ARE DIFFERENT.

See?

"a" and "A" are the same letter. But sometimes case matters.

Multi-character UTF-8 sequences may be the same character. But sometimes 
the sequence matters.

Same exact thing.

>	The only way to argue that normalization is wrong is by providing a
> good reason to preserve the exact byte sequence, and so far the only reason
> I've seen is to help git.

Git doesn't care. Just use the *same* sequence everywhere. Make sure 
something doesn't change it. Because if something changes it, git will 
track it.

> How do you figure? When I type "Märchen", I'm typing a string, not a byte
> sequence. I have no control over the normalization of the characters.
> Therefore, depending on what program I'm typing the name in, I might use the
> same normalization as the filename, or I might miss. It's completely out of my
> control. This is why the filesystem has to step in and say "You composed that
> character differently, but I know you were trying to specify this file".

Pure and utter garbage.

What you are describing is an *input method* issue, not a filesystem 
issue.

The fact that you think this has anything what-so-ever to do with 
filesystems, I cannot understand.

Here's an example: I can type Märchen two different ways on my keyboard: I 
can press the 'ä' key (yes, I have one, I have a Swedish keyboard), or I 
could press the '¨' key and the 'a' key.

See: I get 'ä' and 'ä' respectively.

And as I send this email off, those characters never *ever* got written as 
filenames to any filesystem. But they *did* get written as part of 
text-files to the disk using "write()", yes.

And according to your *insane* logic, that write() call should have 
converted them to the same representation, no?

Hell no! That conversion has absolutely nothing to do with the filesystem. 
It's done at a totally different layer that actually knows what it is 
doing, and turned them both into \xc3\xa4 (and then, the email client 
probably will turn this into Latin1, and send it out as a single-byte 
'\xe4' character).

See? Putting the conversion in the filesystem IS INSANE. You wouldn't make 
the filesystem convert the characters in the data stream (because it would 
cause strange data conversion issues) AND FOR EXACTLY THE SAME REASON it 
shouldn't do it for filenames either!

And your claim that "you have no control over the normalization of 
characters" is simply insane. Of course you have. It's just not supposed 
to be at the filesystem level - whether it's a write() call or a creat() 
call!

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html