Re: git on MacOSX and files with decomposed utf-8 file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Jan 16, 2008, at 11:08 PM, Linus Torvalds wrote:

On Wed, 16 Jan 2008, Kevin Ballard wrote:

I believe it exists because HFS+ was created at a time when the Mac was moving from a multi-encoding world (which was a nightmare) to a Unicode world and they wanted to remove ambiguity in filenames. But I wasn't around when they
made this decision so this is just a guess.

I do agree. And I think starting out case-insensitive (something they must
really hate by now) also made it less of an issue. When you're
case-insensitive, the issues with any UTF-8 normalization are simply
swamped by all the issues of case, so you probably don't even think about
it very much.

Those of us who grew up on a case-insensitive filesystem don't find there to be any problem with it. I can count on one hand the number of times I've run into a problem caused by a case-insensitive filesystem. That number is 1. And that 1 time is when git screwed up trying to track CS4536 and cs4536 in the same directory (see earlier thread).

The big problem with any name rewriting is that I can open file 'xyz', and
I literally have a very hard time knowing whether that file I know I
opened and created has anything to do with the file 'Xyz' that I see when
I do a readdir().

That's only true if you don't know what type of filesystem you're on. And, in the vast majority of cases (in fact, a content tracker is the only exception I can think of), it doesn't matter. If the user said 'xyz' and you can stat() it, great, that's what the user wanted! Just because it's really called 'Xyz' on the filesystem doesn't make any difference.

Are they the same? Maybe. But it's literally hard to tell on OS X. I can
do an fstat() on my file descriptor and on the directory entry, and if
they get the same d_ino they *probably are the same entry, but even then it actually could have been a hardlink (and my 'xyz' is really *another* name for it entirely, and the filesystem is actually case-sensitive and
'Xyz' was a *different* name that somebody else did!).

See? If you're creating a content tracker, these kinds of issues are not "idle chatter". It's really *really* important. Was that file the one I was told to track? Or was it a temporary file that was just hardlinked?

But git is a content tracker, so even if it's really a different hardlink that shouldn't matter, it's still referencing the same content. Go ahead and track whatever name the user specified originally, as long as it maps to a file on disk with the expected content you're set. If the file is really called 'foo' and I told git to track 'Foo', I'm perfectly happy with it continuing to think 'foo' is 'Foo' until I use 'git mv Foo foo'.

This is why case-insensitivity is so hard: you have a very real "aliasing" on the filesystem level, where all those really *different* pathnames end
up being the same thing.

I don't see that as being a problem. Think of it, if you will, as if every single file simply had an implicit hardlink for every possible case or normalization variant. The whole point of the filename is that it is meta-information, used as an identifier and not as actual content, and thus it is perfectly fine for it to be a real string, subject to interpretation, rather than treated as a sacred binary blob like content is. The whole purpose of the name is to identify the inode in question, and case and normalization aren't particularly relevant here. As long as we can identify the file, we're happy.

And all the same issues show up with utf-8 rewriting, so if you normalize utf-8 names, you actually end up having almost all the same problems that a case-insensitive filesystem has. They're just much rarer in practice, so
you just won't hit them as often - but when you do, they are equally
painful!

(In fact, they can be a whole lot *more* painful, because now they are
really rare, and really confusing when they happen!)

But if you come from a case-insensitive background, all the UTF-8
rewriting really looks like such a small problem compared to all the
horrid problems that you had with different locales and cases, so I
suspect they didn't even realize what a big mistake they did!

Again, as someone who grew up in a case-insensitive world, there's no problems here. I wish I could tell you that it causes problems, I wish I could agree with you, but I can't.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com


<<attachment: smime.p7s>>


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux