Re: git on MacOSX and files with decomposed utf-8 file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Jan 19, 2008, at 3:48 AM, Dmitry Potapov wrote:

On Fri, Jan 18, 2008 at 03:24:34PM -0500, Kevin Ballard wrote:
As far as I can tell, the only time you ever run into the problems
you've described on a filesystem which treats filenames as unicode
strings (and therefore is free to normalize), are when you're trying
to interact with a filesystem that treats filenames as sequences of
bytes.

If you read the first message in this thread, you would probably know
that the problem exists on Mac even without any other filesystem being
involved. My understanding of it is that is caused by HFS+ converting
one sequence of Unicode characters (generated by Mac keyboard driver)
to another sequence using "fast decomposed" conversion.

In this case, git's index counts as a filesystem that treats filenames as sequences of bytes. But yes, it is possible, though somewhat difficult, to produce this problem on just HFS+. It's far more common when the file was originally added on a different filesystem

[And please stop calling by normalization what is not. Mac does NOT
normalize Unicode strings, it uses some sub-standard conversion,
which neither produce a normalized string nor is guaranteed to be
stable across versions of Unicode.]

From what the HFS+ technote says, it produces a variant of Normal Form D. This variant, while not guaranteed to be stable across versions of HFS+, but in practice it is stable.

What would you prefer I call it?

This doesn't mean treating filenames as unicode strings is wrong, it
just means that the world would be much better if every filesystem had
the same behaviour here. It's kinda like the endian issue, except
there's no simple solution here.

Actually, there is, if you care to do something. You can write a wrapper around readdir(3) that will recodes filenames in Unicode Normal Forms C.
This does not require much knowledge of Git -- what it requires the
desire to do something to solve the problem. Of course, this step alone
is not a complete solution (it does not solve case-insensitive issue),
but the first step in the right direction...

I'm not sure how that would solve anything. Sure, it would provide a stable, known encoding for git to compare filenames against, but that would only work if the filename is known to be Unicode, and as it has been pointed out on other filesystems the filename can be whatever encoding the user chooses (which, IMHO, is a flaw).

BTW, Git is far from being only software that ran into this problem with Mac. But not being first, we can benefit from other people experiences:
http://osdir.com/ml/network.gnutella.limewire.core.devel/2003-01/msg00000.html

It looks like their problem was binary compatibility with strings from other clients that were using Normal Form C instead of Normal Form D. git's problem is that it's only even using a known encoding on HFS+.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com


<<attachment: smime.p7s>>


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux