Re: git on MacOSX and files with decomposed utf-8 file names

Kevin Ballard <kevin@xxxxxx> · Tue, 22 Jan 2008 19:38:04 -0500

On Jan 22, 2008, at 7:08 PM, Theodore Tso wrote:

On Tue, Jan 22, 2008 at 08:34:27AM -0500, Theodore Tso wrote:
	* Documenting HFS+'s current pseudo-normalization algorithm.
	  It's not enough to say that you need to decompose all
	  Unicode characters, since you've claimed that HFS+ doesn't
	  decompose Unicode characters after some magic date,
	  presumably roughly 9 years ago.

I did some research on this point, since if we really are going to be
compatible with MacOS X's crappy HFS+ system, we need to know what the
decomposition algorithm actually is.  Turns out, there are *two* of
them.  Kevin didn't know what he was talking about.  In fact,
different versions of Mac OS X use different normalization algorithms.

Mac OS X 8.1 through 10.2.x used decompositions based on Unicode 2.1.
Mac OS X 10.3 and later use decompositions based on Unicode 3.2.[1]

As I correctly predicted, Apple is changing their normalization
algorithm in different versions of Mac OS X.  It is not static, which
meands there will be compatibility problems when moving hard drives
between Mac OS X versions.  I don't know if they try to fix this in
their fsck or not, when upgrading from 10.2 to 10.3, but if not,
certain files could disappear as part of the Mac OS X upgrade.  Fun
fun fun.

And clearly Kevin didn't read the tech note very carefully, since it
clearly admits why they did it.  The Mac OS X developers were being
cheasy with how they implemented their HFS B-tree algorithms, and took
the cheap, easy way out.  So yeah, "crappy" is the only word that can
be used for what Mac OS X perpetuated on the world.  Because of that,
a quick Google search shows it causes problems all over the stack, for
many different programs beyond just git, including limewire and
gnutella[2][3], Slim[4], and no doubt others.

[1] http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties
[2] http://lists.limewire.org/pipermail/gui-dev/2003-January/001110.html
[3] http://osdir.com/ml/network.gnutella.limewire.core.devel/2003-01/msg00000.html
[4] http://forums.slimdevices.com/showthread.php?t=40582

In any case, it seems pretty clear that by now everyone except Kevin
has realized that HFS+ is crappy and causes Internet-wide
interoperability problems.  So I'll justify sending this note by
pointing out the specific table of Mac OS's filesystem corruption
algorithm can be found here:

	  http://developer.apple.com/technotes/tn/tn1150table.html

I'd also recommend that the Mac OS X code try to either figure out
whether it is running on an HFS+ partition, or let the HFS+ workaround
code be something that can be controlled via .git/config.  It
shouldn't be on unconditionally even on a Mac OS X system, since if
the git repository is on a ZFS or NFS filesystem, there's no reason to
pay the overhead of working around the HFS+ bugs.

I just finished talking to one of the HFS+ developers, so I suspect I  
know a lot more on this subject now than you do. Here's some of the  
relevant information:

* Any new characters added to Unicode will only have one form  
(decomposed), so HFS+ will always accept new characters as they will  
be NFD. The only exception is case-sensitivity, as the case-folding  
tables in HFS+ are static, so new characters with case variants will  
be treated in a case-sensitive manner. However, as they are already  
decomposed, the NFD algorithm will not change their encoding. This  
means that no, there are zero problems moving HFS+ drives between  
versions of OS X.

* At the time HFS+ was developed, there was no one common standard for  
normalization. The HFS+ developers picked NFD because they thought it  
was "a more flexible, future-looking form", but Microsoft ended up  
picking the opposite just a short time later. Interestingly, NFC is a  
weird hybrid form which only has composed forms for pre-existing  
characters, and decomposed forms for all new characters (as they only  
have one form). So in a sense NFD is more sane then NFC.

* The core issue here, which is why you think HFS+ is so stupid, is  
that you guys see no problem with having 2 files "Märchen" (NFC) and  
"Märchen" (NFD), whereas the HFS+ developers don't consider it  
acceptable to have 2 visually identical names as independent files.  
Unfortunately, the only way to do this matching is to store the  
normalized form in the filesystem, because it would be a performance  
nightmare to try and do this matching any other way. The HFS+  
developers considered it an acceptable trade-off, and as an  
application developer I tend to agree with them.

As I have stated in the past, this isn't a case of HFS+ being stupid  
and causing problems, it's a case of HFS+ being *different* and  
causing problems. But this difference is just as much your fault as it  
is HFS+'s fault.

* For detecting case-sensitive filesystems you can use pathconf(2):  
_PC_CASE_SENSITIVE (if unsupported, you can assume the filesystem is  
case-sensitive). There is also the getattrlist(2) attribute:  
VOL_CAP_FMT_CASE_SENSITIVE.

There appears to be no API for determining if normalization will be  
applied. However, any filesystem that uses UTF-8 explicitly as storage  
(unlike the Linux filesystems, which you claim use UTF-8 but is  
obviously you really use nothing at all) is pretty much guaranteed to  
have to normalize or it will have abysmal performance.

I must say it is shocking that someone as smart as you is still more  
interested in finding ways to prove me wrong then to actually address  
the problem. It's obvious that the only research you did was intended  
to find ways to call me stupid.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com

<<attachment: smime.p7s>>