Re: git on MacOSX and files with decomposed utf-8 file names

Kevin Ballard <kevin@xxxxxx> · Thu, 17 Jan 2008 00:23:48 -0500

On Jan 16, 2008, at 11:51 PM, Martin Langhoff wrote:

On Jan 17, 2008 5:30 PM, Kevin Ballard <kevin@xxxxxx> wrote:
Those of us who grew up on a case-insensitive filesystem don't find
there to be any problem with it. I can count on one hand the number  
of

I guess you haven't used unix tools much. The ever-popular HEAD perl
utility (which does an HTTP HEAD against a URL), when installed,
silently overwrites the head shell utility, which is used for all
sorts of things, some even in startup scripts. Ooops! I've been hit by
this more than once - and if you google for it, it hurt a lot of
people.

I can imagine. However, I've never been hit by such a situation. This  
doesn't mean a case-insensitive filesystem is a problem per se, it  
means interactions between a case-insensitive and a case-sensitive  
filesystem can be a problem. That doesn't mean either way is "correct"  
it just means both don't work well together.

I like ice cream, and I like steak, but I sure don't think a mixture  
of steak and ice cream would go well together. Do you?

That's only true if you don't know what type of filesystem you're on.
And, in the vast majority of cases (in fact, a content tracker is the
only exception I can think of), it doesn't matter. If the user said

Hmmm. Many important tools - that I wouldn't want to ever fail! - have
similar needs to git. Backup/restore and file replication tools for
example.

Both of which would be replicating the directory contents, not a  
listing of files specified by the user. If, as a user, I were to say  
"please replicate file FOO" and the file was really called "foo", I  
wouldn't be in the least surprised to see the tool take me at my word  
and produce a file called "FOO" with the contents of "foo". But in  
general, things like this operate on the filesystem, not on the user  
args.

This is why case-insensitivity is so hard: you have a very real
"aliasing"
on the filesystem level, where all those really *different*
pathnames end
up being the same thing.

I don't see that as being a problem. Think of it, if you will, as if
every single file simply had an implicit hardlink for every possible
case or normalization variant. The whole point of the filename is  
that

Ok - but how do you track the directory then (in git's terms, the
tree). There's no way to tell what the user wants. Does the user want
a copy of the file with different capitalization, or is the OS playing
games?

If I say "track FOO", I probably mean it. So go ahead and track "FOO",  
even if you end up tracking the contents of file "foo". I certainly  
won't blame the tool for doing what I told it.

it is meta-information, used as an identifier and not as actual
content, and thus it is perfectly fine for it to be a real string,
subject to interpretation,

I don't think you *actually* want it subject to interpretation.

Sure I do. I find it  very convenient, for example, to say "cd  
documents/school" when I really want to go to "Documents/School".  
Similarly, if I'm trying to reference gitweb/tests/Märchen, I'm quite  
happy to not have to figure out what normalization the filename is  
using and attempt to replicate that (especially as I have no idea  
which normalization my input mechanism uses - unlike Linus, I don't  
have a key dedicated to ä, and even if I did I wouldn't necessarily  
expect it to use precomposed vs decomposed). I can't think of a single  
reason why I'd want to be able to have 2 different files named  
"Märchen" on my disk. On the other hand, treating unicode  
normalization as significant can pose security risks - how am I to  
know that the file that is named "foo.txt" is really the same file  
"foo.txt" that I last saw? Someone I know on IRC sent me this  
image[1], which shows 6 files all apparently named "foo.txt" on a disk  
image. This is possible because on a case-sensitive HFS+ volume, the  
file system doesn't ignore ignorables when comparing filenames (it  
does on a case-insensitive HFS+ system), and so all of those filenames  
look identical up until you actually pipe their names through xxd and  
look at the byte sequence. When this sort of tomfoolery is possible, I  
simply cannot trust the names of any of my files anymore.

[1]: http://sailor月.com/imgs/ignorable.png

Again, as someone who grew up in a case-insensitive world, there's no
problems here. I wish I could tell you that it causes problems, I  
wish
I could agree with you, but I can't.

Probably because you have been surrounded by tools that have a lot of
extra code to cope with the case insensitive way of life, and learned
to not do things that are completely valid, just to avoid trouble.
Which is ok, but I don't think it makes the OS design decision

Extra code? I don't think so. The only reason I'd need extra code is  
if I were attempting to explicitly detect the "real" filename for a  
user-supplied argument, by scanning the directory contents until I  
found a file that was equivalent to the given argument. But there's no  
reason to do that. None of the code I've ever written, or any of the  
code I've ever seen, has had to do any extra work because it was on a  
case-insensitive filesystem. I contribute to a packaging system for  
the Mac called MacPorts, and I've never seen any patches on any of the  
4000+ ports to handle case insensitivity (granted, I haven't looked at  
every port, but I've looked at a significant fraction). It's a  
complete non-issue.

The content of files is sacred. The filename is only there to provide  
a handle to locate the contents. I don't see any problem with  
expanding the equivalency scope of the filename to accept multiple  
encodings and cases. The only arguments I can see that have any  
validity at all are the ones that sound like "we use case-sensitive  
filesystems, and your case-insensitivity and normalization are causing  
problems with our tools! Conform to our world!". As I said above, this  
isn't a problem of case-insensitivity or normalization, it's a problem  
of interaction between two incompatible viewpoints. All I want to do  
is make git play nicer in an HFS+ world, and this would be far easier  
if you guys were willing to admit this is a problem that should be  
solved in the tool rather than a problem with the system.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com

<<attachment: smime.p7s>>