Re: git on MacOSX and files with decomposed utf-8 file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Jan 22, 2008, at 7:08 PM, Theodore Tso wrote:

On Tue, Jan 22, 2008 at 08:34:27AM -0500, Theodore Tso wrote:
	* Documenting HFS+'s current pseudo-normalization algorithm.
	  It's not enough to say that you need to decompose all
	  Unicode characters, since you've claimed that HFS+ doesn't
	  decompose Unicode characters after some magic date,
	  presumably roughly 9 years ago.

I did some research on this point, since if we really are going to be
compatible with MacOS X's crappy HFS+ system, we need to know what the
decomposition algorithm actually is.  Turns out, there are *two* of
them.  Kevin didn't know what he was talking about.  In fact,
different versions of Mac OS X use different normalization algorithms.

Mac OS X 8.1 through 10.2.x used decompositions based on Unicode 2.1.
Mac OS X 10.3 and later use decompositions based on Unicode 3.2.[1]

As I correctly predicted, Apple is changing their normalization
algorithm in different versions of Mac OS X.  It is not static, which
meands there will be compatibility problems when moving hard drives
between Mac OS X versions.  I don't know if they try to fix this in
their fsck or not, when upgrading from 10.2 to 10.3, but if not,
certain files could disappear as part of the Mac OS X upgrade.  Fun
fun fun.

And clearly Kevin didn't read the tech note very carefully, since it
clearly admits why they did it.  The Mac OS X developers were being
cheasy with how they implemented their HFS B-tree algorithms, and took
the cheap, easy way out.  So yeah, "crappy" is the only word that can
be used for what Mac OS X perpetuated on the world.  Because of that,
a quick Google search shows it causes problems all over the stack, for
many different programs beyond just git, including limewire and
gnutella[2][3], Slim[4], and no doubt others.

[1] http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties
[2] http://lists.limewire.org/pipermail/gui-dev/2003-January/001110.html
[3] http://osdir.com/ml/network.gnutella.limewire.core.devel/2003-01/msg00000.html
[4] http://forums.slimdevices.com/showthread.php?t=40582

In any case, it seems pretty clear that by now everyone except Kevin
has realized that HFS+ is crappy and causes Internet-wide
interoperability problems.  So I'll justify sending this note by
pointing out the specific table of Mac OS's filesystem corruption
algorithm can be found here:

	  http://developer.apple.com/technotes/tn/tn1150table.html

I'd also recommend that the Mac OS X code try to either figure out
whether it is running on an HFS+ partition, or let the HFS+ workaround
code be something that can be controlled via .git/config.  It
shouldn't be on unconditionally even on a Mac OS X system, since if
the git repository is on a ZFS or NFS filesystem, there's no reason to
pay the overhead of working around the HFS+ bugs.

I just finished talking to one of the HFS+ developers, so I suspect I know a lot more on this subject now than you do. Here's some of the relevant information:

* Any new characters added to Unicode will only have one form (decomposed), so HFS+ will always accept new characters as they will be NFD. The only exception is case-sensitivity, as the case-folding tables in HFS+ are static, so new characters with case variants will be treated in a case-sensitive manner. However, as they are already decomposed, the NFD algorithm will not change their encoding. This means that no, there are zero problems moving HFS+ drives between versions of OS X.

* At the time HFS+ was developed, there was no one common standard for normalization. The HFS+ developers picked NFD because they thought it was "a more flexible, future-looking form", but Microsoft ended up picking the opposite just a short time later. Interestingly, NFC is a weird hybrid form which only has composed forms for pre-existing characters, and decomposed forms for all new characters (as they only have one form). So in a sense NFD is more sane then NFC.

* The core issue here, which is why you think HFS+ is so stupid, is that you guys see no problem with having 2 files "Märchen" (NFC) and "Märchen" (NFD), whereas the HFS+ developers don't consider it acceptable to have 2 visually identical names as independent files. Unfortunately, the only way to do this matching is to store the normalized form in the filesystem, because it would be a performance nightmare to try and do this matching any other way. The HFS+ developers considered it an acceptable trade-off, and as an application developer I tend to agree with them.

As I have stated in the past, this isn't a case of HFS+ being stupid and causing problems, it's a case of HFS+ being *different* and causing problems. But this difference is just as much your fault as it is HFS+'s fault.

* For detecting case-sensitive filesystems you can use pathconf(2): _PC_CASE_SENSITIVE (if unsupported, you can assume the filesystem is case-sensitive). There is also the getattrlist(2) attribute: VOL_CAP_FMT_CASE_SENSITIVE.

There appears to be no API for determining if normalization will be applied. However, any filesystem that uses UTF-8 explicitly as storage (unlike the Linux filesystems, which you claim use UTF-8 but is obviously you really use nothing at all) is pretty much guaranteed to have to normalize or it will have abysmal performance.

I must say it is shocking that someone as smart as you is still more interested in finding ways to prove me wrong then to actually address the problem. It's obvious that the only research you did was intended to find ways to call me stupid.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com


<<attachment: smime.p7s>>


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux