Re: git on MacOSX and files with decomposed utf-8 file names

Kevin Ballard <kevin@xxxxxx> · Mon, 21 Jan 2008 15:59:51 -0500

Note: resent to list due to bounce.
Original CC list: tytso@xxxxxxx, torvalds@xxxxxxxxxxxxxxxxxxxx, peter@xxxxxxxxxxxxxxxx 
, mjscod@xxxxxx, melo@xxxxxxxxxxxxxxxx

On Jan 21, 2008, at 3:46 PM, Theodore Tso wrote:

On Mon, Jan 21, 2008 at 03:31:02PM -0500, Kevin Ballard wrote:
No, it's still broken, because of the Unicode-is-not-static problem.
What happens when you start adding more composable characters, which
some future version of HFS+ will start breaking apart?

If you need a static representation, you normalize to a specific  
form. And
in fact, adding new composable characters doesn't matter, since if  
they
didn't exist before, you couldn't have possibly used them.

Sure you can.  Suppose you unpack the same tar file or zip file that
contains one of these new-fangled characters, one on a MacOS 10.5
system, and one on a MacOS 10.9 system.  How HFS+ will corrupt that
filename will differ depending which version of MacOS you are running.
Hence, normalizing the filename when you store it is stupid and
broken.  MacOS and its applications and libraries want to do
normalization in the privacy of its own address space, that's it's
business.  It can pursue any fetish it wants, among consenting adults.
Safe, sane and consensual, and all that... well, consensual, anyway.
I'm not sure about "safe" and "sane"....

You're making the huge assumption that the HFS+ normalization  
algorithms will change. As the technote states:

"Platform algorithms tend to evolve with the Unicode standard. The HFS  
Plus algorithms cannot evolve because such evolution would invalidate  
existing HFS Plus volumes."

My arguement is basically is that there is absolutely no value in what
HFS+ is doing, by corrupting filenames --- if you want to call it
"normalizing" them, fine, but since Unicode is not static, so you
can't even call it a "canonical" form.  It's just some random
corruption of what was passed in at open(2) time, that can and will
change depending on what version of MacOS you are running.

Again with the huge assumptions.

If you want to play the insane Unicode game of "equivalent"
characters, you have to do it at comparison time, so there's no point
trying to "normalize" them when you store them.  It doesn't buy you
anything, and it causes all sorts of pain.

It must have bought somebody something, or they never would have done  
it.

Your entire argument is based on the assumption that HFS+ "corrupts"
filenames in order to allow dumb clients to do byte comparisons,  
and I
don't believe that to be the case.

OK, what's your reason for why HFS+ corrupts filenames?  What do you
think is its excuse?  What problem does it solve?  If the answer is
"no reason at all, but because it *can*", according to the Great God
Unicode, then that's really not very impressive....

I have no idea why HFS+ stores filenames in a normalized form, and  
further I am smart enough to know that speculating is completely  
pointless. I assume the authors had a good reason (which should be a  
safe assumption, filesystem authors are a smart bunch). The reason may  
not be valid anymore, but if it was valid back in 1998, then I can  
accept it without complaining.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com

<<attachment: smime.p7s>>