Re: git on MacOSX and files with decomposed utf-8 file names

Kevin Ballard <kevin@xxxxxx> · Mon, 21 Jan 2008 16:07:27 -0500

On Jan 21, 2008, at 3:56 PM, Dmitry Potapov wrote:

On Mon, Jan 21, 2008 at 02:05:51PM -0500, Kevin Ballard wrote:

But that is *entirely* a separate issue from "normalization".

Kevin, you seem to think that normalization is somehow forced on you
by
the "text-as-codepoints" decision, and that is SIMPLY NOT TRUE.
Normalization is a totally separate decision, and it's a STUPID one,
because it breaks so many of the _nice_ properties of using UTF-8.

I'm not saying it's forced on you, I'm saying when you treat  
filenames
as text,

to treat as text could mean different for different people. Some
may prefer to fi and fi_ligature to be treated as same in some
context.

Those people can use NFKC/NFKD (compatibility equivalence). As I've  
said before, I'm talking about canonical equivalence, because that  
doesn't lose information like compatibility equivalence does (ex. the  
fi ligature gets turned into fi in compatibility equivalence, but not  
canonical equivalence).

it DOESN'T MATTER if the string gets normalized. As long as
the string remains equivalent,

As matter of fact it does, otherwise characters would be the
same and we would not have this conversation at all. String
can be equivalent and not equivalent at the time, because there
are different equivalent relations. Finally, what HFS+ does
is even not normalization. In the technote, Apple explains
that they decompose some characters but not others for better
compatibility. So, you see, there is a PROBLEM here.

Again, I've specified many times that I'm talking about canonical  
equivalence.

And yes, HFS+ does normalization, it just doesn't use NFD. It uses a  
custom variant. I fail to see how this is a problem.

Alright, fine. I'm not saying HFS+ is right in storing the normalized
version, but I do believe the authors of HFS+ must have had a reason
to do that,

I don't say they do that without *any* reason, but I suppose all
Apple developers in the Copland project had some reasons for they
did, but the outcome was not very good...

Stupid engineers don't get to work on developing new filesystems. And  
Copland didn't fail because of stupid engineers anyway. If I had to  
blame someone, I'd blame management.

The only information you lose when doing canonical normalization is
what the original byte sequence was.

Not true. You lose the original sequence of *characters*.

Which is only a problem if you care about the byte sequence, which is  
kinda the whole point of my argument.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com

<<attachment: smime.p7s>>