Re: git on MacOSX and files with decomposed utf-8 file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Jan 19, 2008 at 09:55:45AM -0500, Kevin Ballard wrote:
> 
> >[And please stop calling by normalization what is not. Mac does NOT
> >normalize Unicode strings, it uses some sub-standard conversion,
> >which neither produce a normalized string nor is guaranteed to be
> >stable across versions of Unicode.]
> 
> From what the HFS+ technote says, it produces a variant of Normal  
> Form D. 

There is no such thing in the standard as a variant of NFD. Moreover,
even if this conversion were described in the standard, it would never
called as normalization, because normalization means conversion that
makes all equivalent strings having identical binary representations.
HFS+ conversion does not met this criterion, so it not normalization.

> This variant, while not guaranteed to be stable across  
> versions of HFS+, but in practice it is stable.
> 
> What would you prefer I call it?

Apple calls it as decomposition, which is correct even if it is not full
decomposition as stated in the technote.

> 
> >>This doesn't mean treating filenames as unicode strings is wrong, it
> >>just means that the world would be much better if every filesystem  
> >>had
> >>the same behaviour here. It's kinda like the endian issue, except
> >>there's no simple solution here.
> >
> >Actually, there is, if you care to do something. You can write a  
> >wrapper
> >around readdir(3) that will recodes filenames in Unicode Normal  
> >Forms C.
> >This does not require much knowledge of Git -- what it requires the
> >desire to do something to solve the problem. Of course, this step  
> >alone
> >is not a complete solution (it does not solve case-insensitive issue),
> >but the first step in the right direction...
> 
> I'm not sure how that would solve anything. Sure, it would provide a  
> stable, known encoding for git to compare filenames against, but that  
> would only work if the filename is known to be Unicode, and as it has  
> been pointed out on other filesystems the filename can be whatever  
> encoding the user chooses (which, IMHO, is a flaw).

I believe that Git internally should use only UTF-8 for encoding file
names, commit messages, etc. The problem with some other filesystems
should be addressed separately (by those who work on those systems or
at least have access to them). Regardless interoperability with other
systems, this change alone should solve the issue that was described
in the first message of this thread.

Dmitry
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux