Re: git on MacOSX and files with decomposed utf-8 file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Sat, 19 Jan 2008, Dmitry Potapov wrote:
> 
> Actually, there is, if you care to do something. You can write a wrapper
> around readdir(3) that will recodes filenames in Unicode Normal Forms C.

If somebody wants to do this, then readdir() isn't the only place, but 
yes, readdir() is one of the places.

I suspect that if we were to just do the "turn into NFC on readdir() on OS 
X", that might actually be good enough to hide most of the problems. The 
issue isn't just that OS X mangles the filenames, it's that it picks a 
particularly *stupid* way to mangle them (the decomposed forms), which 
means that OS X will actually not just corrupt "odd cases" of Unicode, but 
will corrupt the obvious and *common* Latin1 translations of Unicode.

I don't know if NFC is better for other locales, but I doubt it. Usually 
people want to do the *composite* forms, not the *de*composed forms.

A trivial example of this for some cross-OS issue:

 - let's say that you have a file "Märchen" on just about *any* other OS 
   than OS X. It could be Latin1 or it could be Unicode, but even if it is 
   Unicode, I can almost guarantee that the 'ä' is going to be the 
   *single* Unicode character U+00e4 (utf-8: "\xc3\xa4", latin1: "\xe4")

   So from a cross-OS standpoint, that's the *common* representation, and 
   yes, you can create the file that way (I don't know what happens if you 
   actually create it with the Latin1 encoding, but I would not be 
   surprised if OS X notices that it's not a valid UTF sequence and 
   assumes it's Latin1 and converts it to Unicode)

 - But on OS X, because of Apples *insane* choice of normal form, it will 
   then be turned into "a¨". I doubt *anybody* else does that. If you have 
   to normalize it, NFD is just about the *worst* choice.

So yeah, even just re-coding it as NFC on readdir() would at least mean 
that any OS X git client would be MORE LIKELY to pick the same 
representation as git clients on other OS's.

It wouldn't solve all problems (and it would almost certainly create a few 
new ones), but it would likely at least increase compatibility between 
systems.

So doing the NFC conversion on readdir() on OS X is probably a good idea, 
and probably is the simplest way to make it interact better with other 
OS's. And it's definitely safe on OS X, since OS X _already_ corrupted the 
name, so we're not losing any information (in contrast, on other systems, 
doing a NFC conversion would possibly lose encoding detail _and_ might be 
incorrect simply because they might not use Unicode in the first place).

Anybody want to creat a compat layer around "readdir()" that does that NFC 
conversion on OS X but not elsewhere?

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux