Re: git on MacOSX and files with decomposed utf-8 file names

Kevin Ballard <kevin@xxxxxx> · Mon, 21 Jan 2008 15:31:02 -0500

On Jan 21, 2008, at 3:15 PM, Theodore Tso wrote:

On Mon, Jan 21, 2008 at 03:01:43PM -0500, Kevin Ballard wrote:

You seem to be under the impression that I'm advocating that git  
treat all
filenames as unicode strings, and thus change its hashing algorithm  
as
described. I am not. I am saying that, if git only had to deal with  
HFS+,
then it could treat all filenames as strings, etc. However, since  
git does
not only have to deal with HFS+, this will not work. What I am  
describing
is an ideal, not a practicality.

Well, why are you arguing on the git list about precisely that (when
you reponsed to Linus), then?

Because of the way in which an argument evolves. This started out as  
"HFS+ is stupid because it normalizes", and I was arguing that said  
normalization wasn't stupid. This turned into an argument as to why HFS 
+ wasn't stupid for normalization, which is basically this argument of  
the ideal. Yes, I realize that it's not producing any practical  
results, but I'm stubborn (as, apparently, are most of you), and I  
believe that if the official stance of the git project is "HFS+ is  
stupid" then there's a lower chance of a patch being accepted then if  
people accept that "HFS+ is different in an incompatible fashion".

In other words, what I'm saying is that treating filenames as  
strings works
perfectly fine, *provided you can do that 100% of the time*. git  
cannot do
that 100% of the time, therefore it's not appropriate here. The  
purpose of
this argument is to illustrate that treating filenames as strings  
isn't
wrong, it's simply incompatible with treating filenames as byte  
sequences.

No, it's still broken, because of the Unicode-is-not-static problem.
What happens when you start adding more composable characters, which
some future version of HFS+ will start breaking apart?

If you need a static representation, you normalize to a specific form.  
And in fact, adding new composable characters doesn't matter, since if  
they didn't exist before, you couldn't have possibly used them. Unless  
you mean adding new composed forms of existing simpler characters, at  
which point you seem to be arguing for NFD instead of NFC.

Presumably the whole *reason* why HFS+ was corrupting strings was so
that "stupid applications" that only did byte comparisons would work
correctly.  But when you upgrade from Mac OS 10.5 to 10.6, and it adds
support for new composable characters, and you now take a USB hard
drive that was hooked up to a MacBook Air, running one version of
MacOS, and hook it up to another Macintosh, running another version of
MacOS, the normalization algorithm will be different, so the byte
comparisons won't work.

I doubt that HFS+ normalized so that "stupid applications" could do  
byte comparisons. But even if that were the case, see previous  
paragraph.

So all of this extra work which MacOS put in to corrupt filenames
behind our back doesn't actually do any good; applications still need
to be smart, or there will be rare, hard to reproduce bugs
nevertheless.  So if MacOS wants to supply Unicode libraries that
compare strings keeping in mind Unicode "equivalences" it can be our
guest (although how they deal with different versions of Unicode with
different equivalence classes will be their cross to bear).  BUT MacOS
X SHOULD NOT BE CORRUPTING FILENAMES.  TO DO SO IS BROKEN.

Your entire argument is based on the assumption that HFS+ "corrupts"  
filenames in order to allow dumb clients to do byte comparisons, and I  
don't believe that to be the case. In fact, it's only considered a  
corruption if you care about the byte sequence of filenames, and my  
argument is that, on HFS+, you aren't supposed to care about the byte  
sequence.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com

<<attachment: smime.p7s>>