On Jan 19, 2008, at 3:48 AM, Dmitry Potapov wrote:
On Fri, Jan 18, 2008 at 03:24:34PM -0500, Kevin Ballard wrote:As far as I can tell, the only time you ever run into the problems you've described on a filesystem which treats filenames as unicode strings (and therefore is free to normalize), are when you're trying to interact with a filesystem that treats filenames as sequences of bytes.If you read the first message in this thread, you would probably know that the problem exists on Mac even without any other filesystem being involved. My understanding of it is that is caused by HFS+ converting one sequence of Unicode characters (generated by Mac keyboard driver) to another sequence using "fast decomposed" conversion.
In this case, git's index counts as a filesystem that treats filenames as sequences of bytes. But yes, it is possible, though somewhat difficult, to produce this problem on just HFS+. It's far more common when the file was originally added on a different filesystem
[And please stop calling by normalization what is not. Mac does NOT normalize Unicode strings, it uses some sub-standard conversion, which neither produce a normalized string nor is guaranteed to be stable across versions of Unicode.]
From what the HFS+ technote says, it produces a variant of Normal Form D. This variant, while not guaranteed to be stable across versions of HFS+, but in practice it is stable.
What would you prefer I call it?
This doesn't mean treating filenames as unicode strings is wrong, itjust means that the world would be much better if every filesystem hadthe same behaviour here. It's kinda like the endian issue, except there's no simple solution here.Actually, there is, if you care to do something. You can write a wrapper around readdir(3) that will recodes filenames in Unicode Normal Forms C.This does not require much knowledge of Git -- what it requires thedesire to do something to solve the problem. Of course, this step aloneis not a complete solution (it does not solve case-insensitive issue), but the first step in the right direction...
I'm not sure how that would solve anything. Sure, it would provide a stable, known encoding for git to compare filenames against, but that would only work if the filename is known to be Unicode, and as it has been pointed out on other filesystems the filename can be whatever encoding the user chooses (which, IMHO, is a flaw).
BTW, Git is far from being only software that ran into this problem with Mac. But not being first, we can benefit from other people experiences:http://osdir.com/ml/network.gnutella.limewire.core.devel/2003-01/msg00000.html
It looks like their problem was binary compatibility with strings from other clients that were using Normal Form C instead of Normal Form D. git's problem is that it's only even using a known encoding on HFS+.
-Kevin Ballard -- Kevin Ballard http://kevin.sb.org kevin@xxxxxx http://www.tildesoft.com
<<attachment: smime.p7s>>