Re: git on MacOSX and files with decomposed utf-8 file names

Kevin Ballard <kevin@xxxxxx> · Wed, 23 Jan 2008 12:19:02 -0500

On Jan 23, 2008, at 11:16 AM, Linus Torvalds wrote:

On Wed, 23 Jan 2008, Theodore Tso wrote:

So this demonstrates that on my MacOS 10.4.11 system, on NFS, MacOS  
is
doing no normalization, as it is creating two files.  On HFS+, MacOS
is mapping both filenames to the same decomposed name.

Well, it demonstrates that (a) the OS and (b) _perl_ don't mangle
filenames on non-HFS+ filesystems.

The problem is that since most native applications *expect* that name
mangling, they'll probably do name mangling of their own  
(internally) just
to compare the names!

Well yes, any context in which a string is treated as Unicode instead  
of an opaque sequence of bytes will probably lead to normalization at  
some point (e.g. when searching text, I'm going to want Märchen and  
Märchen to be treated as the same string). The Mac OS X APIs use NFD,  
and everybody else uses NFC, but either way it's still normalization.

So I would not be surprised if the globbing libraries, for example,  
will
do NFD-mangling in order to glob "correctly", so even programs  
ported from
real Unix might end up getting pathnames subtly changed into NFD as  
part
of some hot library-on-library action with UTF hackery inside.

Why would the globbing libraries have to do anything special to  
understand NFD? In fact, I prefer that they don't - it's very handy to  
be able to type Ma* and have that match Märchen, as the globbing  
library sees Ma??rchen and is happy to match the ??rchen against *.  
Were the filename in NFC, I couldn't do that. Similarly, Ma<tab>  
autocompletes the name Märchen for me. But the convenience is beside  
the point - what I'm trying to show here is that if the globbing  
library were NFD-aware, it probably would decide Ma* shouldn't match  
Märchen, right?

I assume globbing libraries et al don't do UTF-8 hackery in Linux,  
right? And yet using NFC-encoded filenames is fairly common? So why  
should it be any different on OS X, especially since HFS+ isn't the  
only option here (and thus doing NFD conversion in the library would  
mess up other filesystems)?

In fact, probably the biggest reason the NFD-encoding was done at the  
HFS+ level is because they simply couldn't trust user-level libraries  
to always do the NFD conversion for pathnames. And I quote:

"I would prefer that case sensitivity and unicode normalization were  
not the responsibility of the file system -- but I realize that we  
cannot just ignore the problem and let the other layers sort it all  
out."

Things like the finder etc, which must be very aware of the fact that
filenames get corrupted, would presumably internally always convert
everything they get into NFD in order to compare names from different
sources. And as part of that, programs may well corrupt the name  
before
they then use it to create a pathname.

I don't get why you're still calling it corruption when, on an HFS+  
system, NFD-encoding is correct. It would be corruption for HFS+ to  
write anything else but NFD.

The fact that your perl program works under NFS, but creates NFD on  
a VFAT
volume, does imply that they probably used at least some of the same
routines they use in HFS+ for VFAT. Not entirely surprising: doing  
case
insensitive stuff with Unicode is nasty code, so why not share it  
(even if
it's then incorrect for FAT)..

Piece of crap it is, though. Apple has painted themselves into a nasty
corner there.

There's no reason to assume that OS X is actually storing the NFD on  
the volume. In fact, it's quite explicitly not:

"As far as storing exactly what was passed in,  its not just HFS  
that's involved her.  In Mac OS X,  SMB, MSDOS, UDF, ISO 9660  
(Joliet), NTFS and ZFS file systems all store in one form -- NFC.  We  
store in NFC since that what is expected for these files systems.  If  
we were to allow KFD to pass through, it would cause problems when  
these names were accessed outside of Mac OS X.  So this is not just an  
HFS issue but an interchange issue for Mac OS X.  We have the legacy  
NFD use/expectation in our applications and we chose not to ignore the  
problem but make a conscience effort to have the appropriate form used  
(NFD in Mac OS X APIs, NFC elsewhere).  Its not perfect but neither is  
the agnostic approach where both forms can be used and you can have  
duplicate filenames in your file system."

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com

<<attachment: smime.p7s>>