Re: git on MacOSX and files with decomposed utf-8 file names

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Jan 21, 2008, at 5:34 PM, Linus Torvalds wrote:

On Mon, 21 Jan 2008, Kevin Ballard wrote:

I'm really surprised that, after all of this, you're still horribly
misunderstanding my argument. I never said it was invisible. NEVER.

You said it was invisible when you treat things "as text". Here's the
quote:

	.. when you treat filenames as text, it DOESN'T MATTER if the
	string gets normalized ..

Without ever apparently realizing that "as text" is part of the problem in
itself. What is "text" to one person is gibberish to another.

Which is actually a good argument as to why filenames should be enforced as UTF-8.

In particular, the biggest reason to not normalize is that you don't know it's text or Unicode in the first place. Which is why git doesn't do it.

Sure, I understand why git doesn't do it. I'm saying in a system which uses unicode top-to-bottom, which you can create if you're using HFS+ only, can do it. On HFS+ you know the filename is unicode.

And no, even with filenames you don't know that they are "text". People
encode stuff in them. And people don't always use UTF-8.

Again, I was talking about a system that used unicode top-to-bottom. On HFS+ you have to use UTF-8 for your filename or it simply won't work.

Of course, you could ask everybody to create OS X-only programs that know
that under OS X, you only have a subset of filenames. If so, you're
complaining about the wrong tool. Especially when the whole point of the tool was to be distributed (not to mention coming from an environment that
simply doesn't have the same silly limitations OS X has).

So here's a few clues:

- "as text" isn't "as unicode": it may well be Latin1 or EUC-JP or
something. Yes, it's still used. Git doesn't care, and very consciously has avoided forcing character sets, even if the *default* (and notice
  how it's overridable) commit message encoding may be utf-8.

- In fact, even in unicode, the difference between "identical" and
  "equivalent" strings exists, and even in the standard, unicode
strings are very much defined to be arbitrary codepoint sequences, not
  normalized.

So even for the very specific case of unicode text, it's simply not true that "it doesn't matter if the string gets normalized". The unicode spec
itself talks about cases where even canonical normalization makes a
difference.

Search for this quote:

 "Not all processes are required to respect canonical equivalence. For
  example:

   * A function that collects a set of the General_Category values
     present in a string will and should produce a different value for
<angstrom sign, semicolon> than for <A, combining ring above, greek
     question mark>, even though they are canonically equivalent.
* A function that does a binary comparison of strings will also find
     these two sequences different."

and notice that first case. Even things that are *very*much* aware of
Unicode text do actually have cases where canonical equivalence doesn't
mean crud.

I find it amusing that you keep arguing against having git treat filenames as unicode when, if you had actually taken my advice and read my previous email talking about "ideal" vs "practical", you'd realize that I was not suggesting git should. I was simply describing why having the filesystem specifically treat filenames as utf-8 isn't a problem when the entire system is unicode-aware, and thus showing how the problems that are cropping up in git aren't because the filesystem treats filenames as unicode, but rather because the filesystem treats filenames differently than other filesystems. In other words, I was trying to illustrate that HFS+ isn't wrong, it's just different, and the difference is causing the problem.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@xxxxxx
http://www.tildesoft.com


<<attachment: smime.p7s>>


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux