Re: git on MacOSX and files with decomposed utf-8 file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Jan 23, 2008, at 11:16 AM, Linus Torvalds wrote:

On Wed, 23 Jan 2008, Theodore Tso wrote:

So this demonstrates that on my MacOS 10.4.11 system, on NFS, MacOS is
doing no normalization, as it is creating two files.  On HFS+, MacOS
is mapping both filenames to the same decomposed name.

Well, it demonstrates that (a) the OS and (b) _perl_ don't mangle
filenames on non-HFS+ filesystems.

The problem is that since most native applications *expect* that name
mangling, they'll probably do name mangling of their own (internally) just
to compare the names!

Well yes, any context in which a string is treated as Unicode instead of an opaque sequence of bytes will probably lead to normalization at some point (e.g. when searching text, I'm going to want Märchen and Märchen to be treated as the same string). The Mac OS X APIs use NFD, and everybody else uses NFC, but either way it's still normalization.

So I would not be surprised if the globbing libraries, for example, will do NFD-mangling in order to glob "correctly", so even programs ported from real Unix might end up getting pathnames subtly changed into NFD as part
of some hot library-on-library action with UTF hackery inside.

Why would the globbing libraries have to do anything special to understand NFD? In fact, I prefer that they don't - it's very handy to be able to type Ma* and have that match Märchen, as the globbing library sees Ma??rchen and is happy to match the ??rchen against *. Were the filename in NFC, I couldn't do that. Similarly, Ma<tab> autocompletes the name Märchen for me. But the convenience is beside the point - what I'm trying to show here is that if the globbing library were NFD-aware, it probably would decide Ma* shouldn't match Märchen, right?

I assume globbing libraries et al don't do UTF-8 hackery in Linux, right? And yet using NFC-encoded filenames is fairly common? So why should it be any different on OS X, especially since HFS+ isn't the only option here (and thus doing NFD conversion in the library would mess up other filesystems)?

In fact, probably the biggest reason the NFD-encoding was done at the HFS+ level is because they simply couldn't trust user-level libraries to always do the NFD conversion for pathnames. And I quote:

"I would prefer that case sensitivity and unicode normalization were not the responsibility of the file system -- but I realize that we cannot just ignore the problem and let the other layers sort it all out."

Things like the finder etc, which must be very aware of the fact that
filenames get corrupted, would presumably internally always convert
everything they get into NFD in order to compare names from different
sources. And as part of that, programs may well corrupt the name before
they then use it to create a pathname.

I don't get why you're still calling it corruption when, on an HFS+ system, NFD-encoding is correct. It would be corruption for HFS+ to write anything else but NFD.

The fact that your perl program works under NFS, but creates NFD on a VFAT
volume, does imply that they probably used at least some of the same
routines they use in HFS+ for VFAT. Not entirely surprising: doing case insensitive stuff with Unicode is nasty code, so why not share it (even if
it's then incorrect for FAT)..

Piece of crap it is, though. Apple has painted themselves into a nasty
corner there.

There's no reason to assume that OS X is actually storing the NFD on the volume. In fact, it's quite explicitly not:

"As far as storing exactly what was passed in, its not just HFS that's involved her. In Mac OS X, SMB, MSDOS, UDF, ISO 9660 (Joliet), NTFS and ZFS file systems all store in one form -- NFC. We store in NFC since that what is expected for these files systems. If we were to allow KFD to pass through, it would cause problems when these names were accessed outside of Mac OS X. So this is not just an HFS issue but an interchange issue for Mac OS X. We have the legacy NFD use/expectation in our applications and we chose not to ignore the problem but make a conscience effort to have the appropriate form used (NFD in Mac OS X APIs, NFC elsewhere). Its not perfect but neither is the agnostic approach where both forms can be used and you can have duplicate filenames in your file system."

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com


<<attachment: smime.p7s>>


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux