Re: git on MacOSX and files with decomposed utf-8 file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Jan 21, 2008, at 2:57 PM, Theodore Tso wrote:

On Mon, Jan 21, 2008 at 02:05:51PM -0500, Kevin Ballard wrote:
You're right, but it doesn't have to treat it as a binary stream at the level I care about. I mean, no matter what you do at some level the string
is evaluated as a binary stream. For our purposes, just redefine the
hashing algorithm to hash all equivalent strings the same, and you can
implement that by using SHA1 on a particular encoding of the string.

That's horribly broken, for a couple of reasons.  First of all,
changing the hash algorithm breaks compatibility with existing
repositories; sure, you can try to guess what will least likely break
existing repository (which won't be the native MacOSX normalization
algorithm, since it's more likely the combined character will likely
be used on other environments), but there's still no guarantee there
aren't filenames that use some other form of byte-string for the
filename.

Secondly, the hash algorithm would not be stable.  Unicode is not
static, and new characters can get added that may be composable, and
thus would be normalized differently.  This is one of the reasons why
Unicode is so horribly broken as a standard.  It was originally
created by representatives from the printing world that were horribly
clueless about what was needed with respect to canonicalization
representation, so they compromised allowed both forms, not realizing
what a massive f*ckup this would cause later on.  So people have over
the years piled kludges on top of kludges in order to make Unicode
"work".

So we can't blame all of the craziness on the MacOS designers,
although they have seen to have been very creative about how to take a
bad situation and make it worse..

You seem to be under the impression that I'm advocating that git treat all filenames as unicode strings, and thus change its hashing algorithm as described. I am not. I am saying that, if git only had to deal with HFS+, then it could treat all filenames as strings, etc. However, since git does not only have to deal with HFS+, this will not work. What I am describing is an ideal, not a practicality.

In other words, what I'm saying is that treating filenames as strings works perfectly fine, *provided you can do that 100% of the time*. git cannot do that 100% of the time, therefore it's not appropriate here. The purpose of this argument is to illustrate that treating filenames as strings isn't wrong, it's simply incompatible with treating filenames as byte sequences.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com


<<attachment: smime.p7s>>


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux