Re: git on MacOSX and files with decomposed utf-8 file names

Jakub Narebski <jnareb@xxxxxxxxx> · Fri, 18 Jan 2008 12:16:58 +0100

Peter Karlsson wrote:
> Linus Torvalds wrote:
> 
> > The difference I see between us is that when I tell you that this is
> > exactly the same thing as your file *contents*,
> 
> This is the same issue as the CRLF issue I posted on earlier, and it
> all stems from that git also sees file names as a stream of bytes, not
> a string of characters, just as it does text.

You have to be careful about CRLF conversion, lest you corrupt your
binary files. CRLF conversion is off by default.

> > An OS that silently changes the contents of your files is *crap*.
> > Get it?
> 
> A program that silently ignores the conventions of the platform it runs
> on is *crap*, no matter if the conventions are not the same as for
> other platforms.
> 
> > An OS that silently changes the contents of your directories is *crap*.
> > Get it now?
> 
> A program that silently ignores the conventions of the file system it
> tries to store its files on is *crap* :-)

Git philosophy to see the contents of files and "contents" of directories
(filenames) as stream of bytes, i.e. to use 'native' encoding works
perfectly well and _fast_ if all developers work in the same environment.
Troubles start if you are working across operating systems, and across
filesystems.

> In my perfect world, file names would be stored as a string of characters,
> so if I save a file with an å in it, that å would be preserved no
> matter if I run Linux on ext2 with my locale is set to latin-1 (which
> stores it as byte 0xE5), on Windows with NTFS (which stores it as the
> UTF-16 code 0x00E5), on Windows/DOS with FAT (which stores it as the
> byte 0x86) or on Mac OS X which stores it as decomposed UTF-8 (whose
> byte sequence I don't know at the top of my head). If that was just
> stored as U+00E5 in whatever encoding in the filename index, the local
> implementation of git can just check it out in the form needed.

Git has for a long time i18n.commitEncoding, and from some time it
saves it in 'encoding' header in commit object (if different from
'uft-8') and has also i18n.logOutputEncoding.

For dealing with different filesystem encodings you would also have
to have both: encoding used in 'tree' objects (by repository) for
filenames saved somewhere in repository, either in tree object (argh!)
or in some kind of .gitconfig file; encoding used by filesystem in
repository config as i18n.filesystemEncoding or something like that.
And think what to put in the on disk index, and in memory index.

NOTE, NOTE, NOTE! If filename is used somewherein the file contents
(manifest-like file, include-like statement), and this filename uses
characters which are differently encoded in different encoding you
are screwed with this fancy system, badly, anyway.

-- 
Jakub Narebski
Poland
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html