Re: git on MacOSX and files with decomposed utf-8 file names

Kevin Ballard <kevin@xxxxxx> · Mon, 21 Jan 2008 16:49:46 -0500

On Jan 21, 2008, at 4:33 PM, Linus Torvalds wrote:

On Mon, 21 Jan 2008, Kevin Ballard wrote:

I'm not sure what you mean. I stated a fact - at least on OS X, the  
filename
does not contribute to the listed filesize, so changing the  
encoding of the
filename doesn't change the filesize. This isn't a philosophical  
point, it's a
factual statement.

And my point was that your *whole* argument boils down to  
"normalization
is invisible".

When it isn't. It's not invisible for filenames, it's not invisible  
for
file contents.

You're trying to claim that normalization cannot matter. I'm just  
pointing
out that it sure as hell can. Exactly because lots of things don't
actually look at data other than as just a Unicode string. They do  
look at
the raw format.

And that's true both of file contents and file names.

I don't, but I do think this discussion revolves around filenames,  
therefore
it should not surprise you when I talk about filenames.

I'm surprised that you make generalized sweeping statements about  
how it's
ok to normalize because normalization is "invisible", and then when I
point out that that isn't true, you try to limit it.

And no, that normalization is not invisible EVEN IN FILENAMES. If it  
was,
git wouldn't ever have noticed it, would it?

I'm really surprised that, after all of this, you're still horribly  
misunderstanding my argument. I never said it was invisible. NEVER.

I'm also surprised that you seem to care more about this argument then  
my offer to stop arguing and work towards fixing the problem.

And git tries to be a general data tool, not a Unicode-specific one.

Yes, I realize that. See my previous message about discussing ideal  
vs
practicality.

I don't know which argument you're talking about. Git (and, btw,  
Linux)
does the "ideal" thing (don't screw up peoples data), and it turns  
out to
be the "practical" thing too (it can handle a wider range of cases  
than OS
X can).

So no, this is not "ideal" vs "practical". They aren't in any conflict
here.

You misunderstand my point. In a previous email I specifically used  
the words "ideal" and "practical" to describe arguments, which is what  
I was referring to here.

I could argue against this, but frankly, I'm really tired of  
arguing this same
point. I suggest we simply agree to disagree, and move on to  
actually fixing
the problem.

.. and people have even suggested how. Hide the idiotic OS X choices  
by
making a OS X-specific wrapper around readdir() that turns it into  
NFC.

And I've responded to that suggestion, multiple times, saying that  
this doesn't actually fix the problem, it only hides it.

That's just about the best we can do. We can't *fix* the thing that  
OS X
loses information, but a least we can then show the lost information  
in
the same form it _probably_ was in originally.

But no, it won't "fix" git on OS X.

Quite a while ago it was suggested that git uses a table that maps the  
original byte sequence as seen in the index to the form returned by  
readdir(). So far this has sounded like the best solution, but as I've  
said before I don't know git's internals enough (or, really, at all)  
to be able to work on this myself.

This solution should only "lose" information in the case where the  
index has 2 filenames that HFS+ treats as a single filename.

Is there some reason this won't work?

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com

<<attachment: smime.p7s>>