Re: git on MacOSX and files with decomposed utf-8 file names

Kevin Ballard <kevin@xxxxxx> · Wed, 16 Jan 2008 15:39:36 -0500

On Jan 16, 2008, at 11:46 AM, Jakub Narebski wrote:

More like, Mac OS X has standardized on Unicode and the rest of the
world hasn't caught up yet. Git is the only tool I've ever heard  
of that
has a problem with OS X using Unicode.

No.  That's not at all the problem.  Mac OS X insists on storing  
_another_
encoding of your filename.  Both are UTF-8.  Both encode the _same_
string.  Yet they are different, bytewise.  For no good reason.

To be more exact encoding used to _create_ file differs from encoding
returned when _reading directory_...

Stop spreading FUD.  Git can handle Unicode just fine.  In fact,  
Git does
not _care_ how the filename is encoded, it _respects_ the user's  
choice,
not only of the encoding _type_, but the _encoding_, too.

...which means that sequence of bytes differ. And Git by design is
(both for filenames and for blob contents) encoding agnostic.

HFS+ is just _stupid_. And unfortunately Git doesn't support stupid
filesystems (e.g. case insensitive filesystems) well.

There's two different ways to do filesystem encodings. One is to have  
the fs simply not care about encoding, which is what the linux world  
seems to prefer. Sure, this is great in that what you create the file  
with is what you get back, but on the other hand, given an arbitrary  
non-ASCII file on disk, you have absolutely no idea what the encoding  
should be and you can't display it without making assumptions (yes you  
can use heuristics, but you're still making assumptions). Filesystems  
like HFS+ that standardize the encoding, on the other hand, make it  
such that you always know what the encoding of a file should be, so  
you can always display and use the filename intelligently. It also  
means it plays much nicer in a non-ASCII world, since you don't have  
to worry about different normalizations of a given string referring to  
different files (it's one thing to be case-sensitive, but claiming  
that "föo" and "föo" are different files just because one uses a  
composed character and the other doesn't is extremely user- 
unfriendly). On the other hand, what you create the file with may not  
be what you read back later, since the name has been standardized.  
It's hard to say one is better than the other, they're just different  
ways of doing it. However, I have noticed that everybody who's voiced  
an opinion on this list in favor of the encoding-agnostic approach  
seem to be unwilling to accept that any other approach might have  
validity, to the extent of calling an OS/filesystem that does things  
different stupid or insane. This strikes me as extremely elitist and  
risks alienating what I expect to be a fast-growing group of users  
(i.e. OS X users).

I'm willing to give Linus a free pass on calling other OS's stupid and  
insane, as I don't think Linux would exist as it does today without  
his strong opinions, but I don't think this should give carte blanche  
to the rest of the community for this inflammatory behavior.

I should note that I'm only taking the time to discuss this because,  
despite the fact that I'm new to git, I really like it and I want it  
to work better. And one area that it has a problem with is the de- 
facto filesystem on my OS of choice. However, attempts to discuss the  
problem invariable end up with multiple people calling my OS stupid  
and insane simply because it differs in a particular design decision.  
This is not a good way to build a community or to build a better  
product, and I hope it can be improved.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com

<<attachment: smime.p7s>>