Re: git on MacOSX and files with decomposed utf-8 file names

Dmitry Potapov <dpotapov@xxxxxxxxx> · Thu, 17 Jan 2008 02:52:58 +0300

On Wed, Jan 16, 2008 at 03:39:36PM -0500, Kevin Ballard wrote:
> On Jan 16, 2008, at 11:46 AM, Jakub Narebski wrote:
> 
> >
> >HFS+ is just _stupid_. And unfortunately Git doesn't support stupid
> >filesystems (e.g. case insensitive filesystems) well.
> 
> There's two different ways to do filesystem encodings. One is to have  
> the fs simply not care about encoding, which is what the linux world  
> seems to prefer. 

There is no technical reason for *kernel* to care about file name
encoding. It is something that can be and should be dealt with in
the user space (except some special cases like smbfs).

> Sure, this is great in that what you create the file  
> with is what you get back,

And also because a user space program can deal with it much more
gracefully...

> but on the other hand, given an arbitrary  
> non-ASCII file on disk, you have absolutely no idea what the encoding  
> should be and you can't display it without making assumptions (yes you  
> can use heuristics, but you're still making assumptions).

Wrong. If you have a policy that all file names are stored in UTF-8
encoding then there is no problem here. It should not be a kernel
problem to care about encoding, besides you cannot fully solve it
in the kernel space anyway...

> Filesystems  
> like HFS+ that standardize the encoding,

Yeah, right... Like Microsoft likes to "standardize" everything, which
in practice means forcing on others something fundamentally broken and
that does not follow any existing standard precisely:

===
IMPORTANT:
The terms used in this Q&A, decomposed and precomposed, roughly
correspond to Unicode Normal Forms D and C, respectively. However, most
volume formats do not follow the exact specification for these normal
forms.
===
http://developer.apple.com/qa/qa2001/qa1173.html

Not to mention that the use of decomposed Unicode as the standard is
outright silly -- no sane person writes in "decomposed" Unicode...

> on the other hand, make it  
> such that you always know what the encoding of a file should be, so  
> you can always display and use the filename intelligently.

Somehow I have no problem with displaying non-ASCII names on Linux.
I can see both Unicode Normal Forms C and D encoded symbols without
any problem, though the kernel is completely unaware about them.

> It also  
> means it plays much nicer in a non-ASCII world, since you don't have  
> to worry about different normalizations of a given string referring to  
> different files (it's one thing to be case-sensitive, but claiming  
> that "föo" and "föo" are different files

As you typed them, they both are exactly the same, and both of them are
in the Normal Forms C (which Mac calls as precomposed). So why do you
use one encoding in your writings and the other in your file names?

> just because one uses a  
> composed character and the other doesn't is extremely user- 
> unfriendly). On the other hand, what you create the file with may not  
> be what you read back later, since the name has been standardized.  
> It's hard to say one is better than the other, they're just different  
> ways of doing it. However, I have noticed that everybody who's voiced  
> an opinion on this list in favor of the encoding-agnostic approach  
> seem to be unwilling to accept that any other approach might have  
> validity, to the extent of calling an OS/filesystem that does things  
> different stupid or insane. This strikes me as extremely elitist and  
> risks alienating what I expect to be a fast-growing group of users  
> (i.e. OS X users).

I am sure everyone here is scared to death... I mean we have used to
hear such threats from some MS salespeople, but from a Mac guy? It is
really scare....

Wake up, and stop shooting this nonsense at us. If you have technical
reasons why your solution is better, let us know. So far, you do not
sound very convincing here. Why do think that the issue of encoding can
not be dealt with in the user space? Why does Mac OS X uses so-called
decomposed Unicode, which even does not follow any standard precisely?
Why does Mac OS X chose to decompose characters while it does not
solve any real issue?

> And one area that it has a problem with is the de- 
> facto filesystem on my OS of choice.

I suppose it would be much better a subject for discussion...
At least, it would be more likely to result in that Git working
better on your OS.

> However, attempts to discuss the  
> problem invariable end up with multiple people calling my OS stupid  
> and insane simply because it differs in a particular design decision.  

First, no one called Mac OS X insane, but case insensitive filesystems,
and there are good reasons to think so, because no one has demonstrated
so far any advantage of that approach, but disadvantages are quite 
obvious to anyone -- comparison of a stored file list with readdir()
is much more problematic, and you cannot say that you have solved the
problem with encoding if you force other people to *duplicate* some
logic that Mac OS X does in its kernel just to get things working...
So, no one thinks it is insane because it is different, but because it
requires much more efforts to do the same thing -- compare two file
lists, and this operation is important for Git to work properly...

Dmitry
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html