Re: git on MacOSX and files with decomposed utf-8 file names

Kevin Ballard <kevin@xxxxxx> · Wed, 16 Jan 2008 23:30:01 -0500

On Jan 16, 2008, at 11:08 PM, Linus Torvalds wrote:

On Wed, 16 Jan 2008, Kevin Ballard wrote:

I believe it exists because HFS+ was created at a time when the Mac  
was moving
from a multi-encoding world (which was a nightmare) to a Unicode  
world and
they wanted to remove ambiguity in filenames. But I wasn't around  
when they
made this decision so this is just a guess.

I do agree. And I think starting out case-insensitive (something  
they must
really hate by now) also made it less of an issue. When you're
case-insensitive, the issues with any UTF-8 normalization are simply
swamped by all the issues of case, so you probably don't even think  
about
it very much.

Those of us who grew up on a case-insensitive filesystem don't find  
there to be any problem with it. I can count on one hand the number of  
times I've run into a problem caused by a case-insensitive filesystem.  
That number is 1. And that 1 time is when git screwed up trying to  
track CS4536 and cs4536 in the same directory (see earlier thread).

The big problem with any name rewriting is that I can open file  
'xyz', and
I literally have a very hard time knowing whether that file I know I
opened and created has anything to do with the file 'Xyz' that I see  
when
I do a readdir().

That's only true if you don't know what type of filesystem you're on.  
And, in the vast majority of cases (in fact, a content tracker is the  
only exception I can think of), it doesn't matter. If the user said  
'xyz' and you can stat() it, great, that's what the user wanted! Just  
because it's really called 'Xyz' on the filesystem doesn't make any  
difference.

Are they the same? Maybe. But it's literally hard to tell on OS X. I  
can
do an fstat() on my file descriptor and on the directory entry, and if
they get the same d_ino they *probably are the same entry, but even  
then
it actually could have been a hardlink (and my 'xyz' is really  
*another*
name for it entirely, and the filesystem is actually case-sensitive  
and
'Xyz' was a *different* name that somebody else did!).

See? If you're creating a content tracker, these kinds of issues are  
not
"idle chatter". It's really *really* important. Was that file the  
one I
was told to track? Or was it a temporary file that was just  
hardlinked?

But git is a content tracker, so even if it's really a different  
hardlink that shouldn't matter, it's still referencing the same  
content. Go ahead and track whatever name the user specified  
originally, as long as it maps to a file on disk with the expected  
content you're set. If the file is really called 'foo' and I told git  
to track 'Foo', I'm perfectly happy with it continuing to think 'foo'  
is 'Foo' until I use 'git mv Foo foo'.

This is why case-insensitivity is so hard: you have a very real  
"aliasing"
on the filesystem level, where all those really *different*  
pathnames end
up being the same thing.

I don't see that as being a problem. Think of it, if you will, as if  
every single file simply had an implicit hardlink for every possible  
case or normalization variant. The whole point of the filename is that  
it is meta-information, used as an identifier and not as actual  
content, and thus it is perfectly fine for it to be a real string,  
subject to interpretation, rather than treated as a sacred binary blob  
like content is. The whole purpose of the name is to identify the  
inode in question, and case and normalization aren't particularly  
relevant here. As long as we can identify the file, we're happy.

And all the same issues show up with utf-8 rewriting, so if you  
normalize
utf-8 names, you actually end up having almost all the same problems  
that
a case-insensitive filesystem has. They're just much rarer in  
practice, so
you just won't hit them as often - but when you do, they are equally
painful!

(In fact, they can be a whole lot *more* painful, because now they are
really rare, and really confusing when they happen!)

But if you come from a case-insensitive background, all the UTF-8
rewriting really looks like such a small problem compared to all the
horrid problems that you had with different locales and cases, so I
suspect they didn't even realize what a big mistake they did!

Again, as someone who grew up in a case-insensitive world, there's no  
problems here. I wish I could tell you that it causes problems, I wish  
I could agree with you, but I can't.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com

<<attachment: smime.p7s>>