Re: git on MacOSX and files with decomposed utf-8 file names

Kevin Ballard <kevin@xxxxxx> · Mon, 21 Jan 2008 12:25:52 -0500

On Jan 21, 2008, at 12:08 PM, Nicolas Pitre wrote:

On Mon, 21 Jan 2008, Kevin Ballard wrote:

On Jan 21, 2008, at 9:14 AM, Peter Karlsson wrote:

I happen to prefer the text-as-string-of-characters (or code points,
since you use the other meaning of characters in your posts),  
since I
come from the text world, having worked a lot on Unicode text
processing.

You apparently prefer the text-as-sequence-of-octets, which I tend  
to
dislike because I would have thought computer engineers would have
evolved beyond this when we left the 1900s.

I agree. Every single problem that I can recall Linus bringing up  
as a
consequence of HFS+ treating filenames as strings is in fact only a  
problem if
you then think of the filename as octets at some point. If you  
stick with
UTF-8 equivalence comparison the entire time, then everything just  
works.

Granted, this is a problem when you have to operate on a filesystem  
that
thinks of filenames as octets, but as I said before, this doesn't  
mean the
HFS+ approach is wrong, it just means it's incompatible with  
Linus's approach.

Linus' approach is _FAST_.

Why do you think Git has now acquired a reputation of kicking asses  
all
around the SCM scene?

The HFS+ approach might be fine if you think of it in terms of "the  
user
will be awfully confused if two file names are shown identically in  
the
File Open dialog box".  But it otherwise sucks big time when it  
comes to
high performance applications needing to deal with a huge amount of  
file
names at once.

Normalization will always hurt performances.  This is an overhead.
Sometimes that overhead might be insignificant and not be perceptible,
but sometimes it is.  And Git is clearly in the later case.  
Performances
will be hurt big time the day it is made aware of that normalization.
This is why there is so much resistance about it, especially when the
benefits of normalizing file names are not shown to be worth their  
cost
in performance and complexity, as other systems do rather fine without
it.

I agree, Linus's approach is indeed fast. And if speed is more  
important than treating filenames as text instead of octets, then so  
be it. This is a trade-off. But a trade-off doesn't mean one approach  
is "wrong", it just means the authors of HFS+ thought it was an  
acceptable trade-off. HFS+ wasn't designed to be a high-performance  
filesystem that deals with lots of files, it was designed to be a  
filesystem used by regular people on the Mac, and I believe treating  
filenames as text is a good choice in this scenario. Unfortunately,  
this does mean git has to do extra work to behave correctly on this  
system.

Now, to move on to actually coming up with a solution. Unfortunately I  
don't know enough about the internals of git to really evaluate the  
proposed ideas myself, or to write a patch. Hopefully I'll come up  
with the time to acquire the necessary knowledge, but until then I can  
only participate in these higher-level discussions.

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com

<<attachment: smime.p7s>>