Re: git on MacOSX and files with decomposed utf-8 file names

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 21 Jan 2008 14:34:32 -0800 (PST)

On Mon, 21 Jan 2008, Kevin Ballard wrote:
> 
> I'm really surprised that, after all of this, you're still horribly
> misunderstanding my argument. I never said it was invisible. NEVER.

You said it was invisible when you treat things "as text". Here's the 
quote:

	.. when you treat filenames as text, it DOESN'T MATTER if the 
 	string gets normalized ..

Without ever apparently realizing that "as text" is part of the problem in 
itself. What is "text" to one person is gibberish to another.

In particular, the biggest reason to not normalize is that you don't know 
it's text or Unicode in the first place. Which is why git doesn't do it.

And no, even with filenames you don't know that they are "text". People 
encode stuff in them. And people don't always use UTF-8. 

Of course, you could ask everybody to create OS X-only programs that know 
that under OS X, you only have a subset of filenames. If so, you're 
complaining about the wrong tool. Especially when the whole point of the 
tool was to be distributed (not to mention coming from an environment that 
simply doesn't have the same silly limitations OS X has).

So here's a few clues:

 - "as text" isn't "as unicode": it may well be Latin1 or EUC-JP or
   something. Yes, it's still used. Git doesn't care, and very consciously 
   has avoided forcing character sets, even if the *default* (and notice 
   how it's overridable) commit message encoding may be utf-8.

 - In fact, even in unicode, the difference between "identical" and 
   "equivalent" strings exists, and even in the standard, unicode 
   strings are very much defined to be arbitrary codepoint sequences, not 
   normalized.

So even for the very specific case of unicode text, it's simply not true 
that "it doesn't matter if the string gets normalized". The unicode spec 
itself talks about cases where even canonical normalization makes a 
difference.

Search for this quote:

  "Not all processes are required to respect canonical equivalence. For 
   example:

    * A function that collects a set of the General_Category values 
      present in a string will and should produce a different value for 
      <angstrom sign, semicolon> than for <A, combining ring above, greek 
      question mark>, even though they are canonically equivalent.
    * A function that does a binary comparison of strings will also find 
      these two sequences different."

and notice that first case. Even things that are *very*much* aware of 
Unicode text do actually have cases where canonical equivalence doesn't 
mean crud.

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html