Robin Rosenberg wrote on 13.5.2009 8:24:
If the conclusion is that this is a way forward, then I
could start working on a completely new set of much cleaner patches.,
That would be great!
I see that in those early patches you took the approach of converting
the filenames from the local encoding to UTF-8 at the outer edges of
Git. That obviously was the easiest way to make the changes with minimal
changes to Git.
I've been thinking about a bit more extensive approach, which should
serve the interest of all stakeholders:
Now the tree object contains the following information for each file:
filename, mode, sha1. To that would be added one more string: filename
encoding. Unless the encoding is specified (such as in old commits
before the encoding information was added), the default encoding is
"binary", which is the same as how Git works now (it thinks filenames as
series of bytes, ignoring their encoding completely).
When a file is added/committed, the following things will happen:
1. Git finds out what is the filename encoding used by the system. Git
will try to detect it automatically from the environment, and the
autodetected value can be overridden by setting a config variable
"i18n.localFilenameEncoding". If autodetection fails, it will default to
"binary".
2. Git reads the config variable "i18n.commitFilenameEncoding". If
localFilenameEncoding equals commitFilenameEncoding, or if either of
them is "binary", go to step 3A. Otherwise go to step 3B.
3A. Git saves the filename together with the local filename encoding.
The bytes of the filename are not changed when it is stored in the
repository (the same as now).
3B. Git converts the filename from localFilenameEncoding to
commitFilenameEncoding. (The commitFilenameEncoding may also specify a
normalized form for UTF-8, for example "UTF-8 NFC". This is needed for
Mac OS X.) Then Git saves the filename together with the commit filename
encoding.
When a file is checked out, the following things will happen:
1. Git reads the actual filename encoding from the repository. If it is
not specified, "binary" will be assumed.
2. Git detects the local filename encoding, the same was as before. If
the actual filename encoding equals the local filename encoding, or if
either of them is "binary", go to step 3A. Otherwise go to step 3B.
3A. Git creates the file using the same bytes as filename as what is
stored in the repository. This is the same as how Git works now.
3B. Git converts the filename from the actual filename encoding to the
local filename encoding, and creates the file using the encoding of the
local platform.
This should fit in with Git's philosophy of not modifying the user's
data without the user's permission. The data will always be stored
unchanged into the repository, unless the user specifies
"i18n.commitFilenameEncoding". The conversions are by default done only
on checkout. Git will try to serve the needs of the user as well as it
can by detecting the local filename encoding, but if the user so
desires, he can disable the conversions by specifying
"i18n.localFilenameEncoding" as "binary", in which case Git will work
the same way as it does today.
I was browsing Git's code, and it seems that the encoding information
would need to be added to struct name_entry in tree-walk.h. A quick
search reveals that name_entry is used in 15 files, out of which only 4
files use it more than once. It would probably make sense to create a
new datatype for the filename, for example "struct encoded_path { const
char *path; const char *encoding; }", and then provide functions for
accessing the filename with the right encoding (commit or local).
I might even myself be able to make that change, because Git is not
legacy software (it has tests) and the needed changes seem quite local.
I would just need a way to detect the encodings (at first it could rely
on manually set config variables) and have a library for doing the
encoding conversions.
One big question is, that will this change require a change to the
repository format? Will it be possible to add the encoding field to the
tree object, without breaking compatibility with older Git clients? If
compatibility needs to be broken, how it can be done in a controlled
fashion?
--
Esko Luontola
www.orfjackal.net
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html