Some projects may like to enforce a particular encoding is used for all filenames in the repository. Within the UTF-8 encoding, there are four normal forms (see http://unicode.org/reports/tr15/), any of which may be a reasonable repository format choice. Additionally, some filesystems may have a single encoding that they support when writing local filenames. To support this, iconv and a normalization library must have the information they need to perform the correct conversion. This is a configuration design proposal, and does not implement any changes. --- Hi all, I think that restating the problem in these terms might be more productive than the previous discussion, design critiques? It is intended that this doesn't impact at all on users with C filesystems without explicit configuration, while adding the feature of allowing projects to specify unicode normalisation (so, eg, Märchen ends up the same as Märchen) [apologies if this hits the list twice; I sent the first with a bad content encoding header and assume it got dropped] Documentation/config.txt | 16 ++++++++++++++++ Documentation/gitattributes.txt | 19 +++++++++++++++++++ Documentation/i18n.txt | 9 ++++++--- 3 files changed, 41 insertions(+), 3 deletions(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index ee08845..9d2567d 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -146,6 +146,22 @@ core.symlinks:: file. Useful on filesystems like FAT that do not support symbolic links. True by default. +core.repositoryPathEncoding:: + Specify the default assumed encoding of repository paths, if + not specified in gitlink:gitattributes[3] for that repository. + The default value of this is "C". + +core.checkoutPathEncoding:: + Specify the encoding of local filenames. The default value of + this depends on the platform and filesystem, but for most users + will be "C", indicating no pathname conversion required. + +core.checkoutPathEncodingFromLocale:: + Specify whether the checkout path encoding should be + controlled via environment locale variables. This may have + some bizarre side effects if you switch locales between + working with a checkout. False by default. + core.gitProxy:: A "proxy command" to execute (as 'command host port') instead of establishing direct connection to the remote server when diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt index cc9c7c5..4136528 100644 --- a/Documentation/gitattributes.txt +++ b/Documentation/gitattributes.txt @@ -170,6 +170,25 @@ intent is that if someone unsets the filter driver definition, or does not have the appropriate filter program, the project should still be usable. +`encoding` +^^^^^^^^^^ +Specifies the valid encoding for file names (does not affect content) +on the specified path. Git enforces that all filenames are valid in +this encoding, and if applicable and possible, will translate from the +encoding configured (or, on relevant platform and filesystem +combinations, detected) to this encoding. + +The default value of this is "C", which leaves behaviour on +filesystems which do not support "C" semantics undefined until it is +set. For instance, if your filesystem supports only UTF-8, and you +are trying to check out a repository that is in Latin-1, then you will +need to configure the repository encoding in `.git/info/attributes` +before you can check files out on that system. + +Valid encodings are currently 'ISO-8859-1' and 'UTF-8'. 'UTF-8' may +be followed by '+NFC', '+NFD', '+NFKD' or '+NFKC' to enforce a +particular normalization of filenames. + Interaction between checkin/checkout attributes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/Documentation/i18n.txt b/Documentation/i18n.txt index b95f99b..fba0407 100644 --- a/Documentation/i18n.txt +++ b/Documentation/i18n.txt @@ -1,11 +1,14 @@ At the core level, git is character encoding agnostic. - The pathnames recorded in the index and in the tree objects - are treated as uninterpreted sequences of non-NUL bytes. + are normally treated as uninterpreted sequences of non-NUL bytes. What readdir(2) returns are what are recorded and compared with the data git keeps track of, which in turn are expected - to be what lstat(2) and creat(2) accepts. There is no such - thing as pathname encoding translation. + to be what lstat(2) and creat(2) accepts. + +However, if there are configured encodings for the checkout and/or +repository, then the defined conversions will occur between the +readdir(2) and the index, in both directions. - The contents of the blob objects are uninterpreted sequence of bytes. There is no encoding translation at the core -- 1.5.3.5 - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html