Re: [RFC 1/8] UTF helpers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Robin Rosenberg wrote on 13.5.2009 8:24:
If the conclusion is that this is a way forward, then I
could start working on a completely new set of much cleaner patches.,

That would be great!

I see that in those early patches you took the approach of converting the filenames from the local encoding to UTF-8 at the outer edges of Git. That obviously was the easiest way to make the changes with minimal changes to Git.

I've been thinking about a bit more extensive approach, which should serve the interest of all stakeholders:


Now the tree object contains the following information for each file: filename, mode, sha1. To that would be added one more string: filename encoding. Unless the encoding is specified (such as in old commits before the encoding information was added), the default encoding is "binary", which is the same as how Git works now (it thinks filenames as series of bytes, ignoring their encoding completely).

When a file is added/committed, the following things will happen:

1. Git finds out what is the filename encoding used by the system. Git will try to detect it automatically from the environment, and the autodetected value can be overridden by setting a config variable "i18n.localFilenameEncoding". If autodetection fails, it will default to "binary".

2. Git reads the config variable "i18n.commitFilenameEncoding". If localFilenameEncoding equals commitFilenameEncoding, or if either of them is "binary", go to step 3A. Otherwise go to step 3B.

3A. Git saves the filename together with the local filename encoding. The bytes of the filename are not changed when it is stored in the repository (the same as now).

3B. Git converts the filename from localFilenameEncoding to commitFilenameEncoding. (The commitFilenameEncoding may also specify a normalized form for UTF-8, for example "UTF-8 NFC". This is needed for Mac OS X.) Then Git saves the filename together with the commit filename encoding.


When a file is checked out, the following things will happen:

1. Git reads the actual filename encoding from the repository. If it is not specified, "binary" will be assumed.

2. Git detects the local filename encoding, the same was as before. If the actual filename encoding equals the local filename encoding, or if either of them is "binary", go to step 3A. Otherwise go to step 3B.

3A. Git creates the file using the same bytes as filename as what is stored in the repository. This is the same as how Git works now.

3B. Git converts the filename from the actual filename encoding to the local filename encoding, and creates the file using the encoding of the local platform.


This should fit in with Git's philosophy of not modifying the user's data without the user's permission. The data will always be stored unchanged into the repository, unless the user specifies "i18n.commitFilenameEncoding". The conversions are by default done only on checkout. Git will try to serve the needs of the user as well as it can by detecting the local filename encoding, but if the user so desires, he can disable the conversions by specifying "i18n.localFilenameEncoding" as "binary", in which case Git will work the same way as it does today.


I was browsing Git's code, and it seems that the encoding information would need to be added to struct name_entry in tree-walk.h. A quick search reveals that name_entry is used in 15 files, out of which only 4 files use it more than once. It would probably make sense to create a new datatype for the filename, for example "struct encoded_path { const char *path; const char *encoding; }", and then provide functions for accessing the filename with the right encoding (commit or local).

I might even myself be able to make that change, because Git is not legacy software (it has tests) and the needed changes seem quite local. I would just need a way to detect the encodings (at first it could rely on manually set config variables) and have a library for doing the encoding conversions.

One big question is, that will this change require a change to the repository format? Will it be possible to add the encoding field to the tree object, without breaking compatibility with older Git clients? If compatibility needs to be broken, how it can be done in a controlled fashion?

--
Esko Luontola
www.orfjackal.net
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]