Re: git on MacOSX and files with decomposed utf-8 file names

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 17 Jan 2008 11:11:27 -0800 (PST)

On Thu, 17 Jan 2008, Pedro Melo wrote:
>
> We have people using windows, people using Macs, and people using several
> flavors of Linux desktops. They all have different settings and if I add a
> file like áéióú that happens to be UTF-8 encoded, it will reach a iso-latin-1
> user as visual garbage.

Yes.

> git will track the file perfectly, we know that, because the sequence of 
> bytes that my system used to create the file will be the same on all 
> "sane" systems, but the file will look "funny" to some users, and we get 
> complaints for some less enlightened ones.

I can't really suggest anything else than trying to make everybody use 
UTF-8.

[ Not just for filenames, by the way - this is one of the reasons I think
  it is so *important* to not corrupt filenames, exactly because this is 
  in no way filename-specific at all, and filenames are generally "textual 
  data" exactly the same way a text-file is.

  But only totally insane people think that you should force-normalize 
  text-files, even though all the issues are obviously all the same 
  regardless of whether it's a filename or a word in textfile. ]

And yes, I also realize that it's not going to be realistic. We're 
probably *closer* to that than we used to be, but I don't think you can 
even make Windows think FAT is UTF-8.

I don't know how NTFS works (I know it is Unicode-aware, and I think it 
encodes filenames in UCS-2 or possibly UTF-16, but there is an obvious 1:1 
translation to UTF-8, and since we use C strings, I'd assume/hope Windows 
actually uses that unambiguous translation for any filenames).

Under modern Linux and OS X, UTF-8 is basically the only way (older Linux 
distros may be set up for Latin1, but at least the newer ones seem to all 
default to a UTF-8 locale).

> The answer is that users should not create filenames with non-ascii characters
> if they want a consistent experience, right?

Oh, absolutely. That takes care of 99.9% of all source projects. Even then 
you can have problems with case insensitivity (the Linux kernel sources 
are all US-ASCII filenames, for example, but *literally* has many files 
that are identical if you ignore case, and that's not unheard of).

So yes, to a first approximation, the answer is to simply avoid using 
anything but US-ASCII. It's seldom a big limitation when talking about 
filenames.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html