[PATCH] [RFC] Design for pathname encoding gitattribute [RESEND]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Some projects may like to enforce a particular encoding is used for
all filenames in the repository.  Within the UTF-8 encoding, there are
four normal forms (see http://unicode.org/reports/tr15/), any of which
may be a reasonable repository format choice.  Additionally, some
filesystems may have a single encoding that they support when writing
local filenames.  To support this, iconv and a normalization library
must have the information they need to perform the correct conversion.

This is a configuration design proposal, and does not implement any
changes.
---
   Hi all, I think that restating the problem in these terms might be
   more productive than the previous discussion, design critiques?

   It is intended that this doesn't impact at all on users with C
   filesystems without explicit configuration, while adding the feature
   of allowing projects to specify unicode normalisation (so, eg,
   Märchen ends up the same as Märchen)

   [apologies if this hits the list twice; I sent the first with a bad
    content encoding header and assume it got dropped]

 Documentation/config.txt        |   16 ++++++++++++++++
 Documentation/gitattributes.txt |   19 +++++++++++++++++++
 Documentation/i18n.txt          |    9 ++++++---
 3 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ee08845..9d2567d 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -146,6 +146,22 @@ core.symlinks::
 	file. Useful on filesystems like FAT that do not support
 	symbolic links. True by default.
 
+core.repositoryPathEncoding::
+	Specify the default assumed encoding of repository paths, if
+	not specified in gitlink:gitattributes[3] for that repository.
+	The default value of this is "C".
+
+core.checkoutPathEncoding::
+	Specify the encoding of local filenames.  The default value of
+	this depends on the platform and filesystem, but for most users
+	will be "C", indicating no pathname conversion required.
+
+core.checkoutPathEncodingFromLocale::
+	Specify whether the checkout path encoding should be
+	controlled via environment locale variables.  This may have
+	some bizarre side effects if you switch locales between
+	working with a checkout.  False by default.
+
 core.gitProxy::
 	A "proxy command" to execute (as 'command host port') instead
 	of establishing direct connection to the remote server when
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index cc9c7c5..4136528 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -170,6 +170,25 @@ intent is that if someone unsets the filter driver definition,
 or does not have the appropriate filter program, the project
 should still be usable.
 
+`encoding`
+^^^^^^^^^^
+Specifies the valid encoding for file names (does not affect content)
+on the specified path.  Git enforces that all filenames are valid in
+this encoding, and if applicable and possible, will translate from the
+encoding configured (or, on relevant platform and filesystem
+combinations, detected) to this encoding.
+
+The default value of this is "C", which leaves behaviour on
+filesystems which do not support "C" semantics undefined until it is
+set.  For instance, if your filesystem supports only UTF-8, and you
+are trying to check out a repository that is in Latin-1, then you will
+need to configure the repository encoding in `.git/info/attributes` 
+before you can check files out on that system.
+
+Valid encodings are currently 'ISO-8859-1' and 'UTF-8'.  'UTF-8' may
+be followed by '+NFC', '+NFD', '+NFKD' or '+NFKC' to enforce a
+particular normalization of filenames.
+
 
 Interaction between checkin/checkout attributes
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/Documentation/i18n.txt b/Documentation/i18n.txt
index b95f99b..fba0407 100644
--- a/Documentation/i18n.txt
+++ b/Documentation/i18n.txt
@@ -1,11 +1,14 @@
 At the core level, git is character encoding agnostic.
 
  - The pathnames recorded in the index and in the tree objects
-   are treated as uninterpreted sequences of non-NUL bytes.
+   are normally treated as uninterpreted sequences of non-NUL bytes.
    What readdir(2) returns are what are recorded and compared
    with the data git keeps track of, which in turn are expected
-   to be what lstat(2) and creat(2) accepts.  There is no such
-   thing as pathname encoding translation.
+   to be what lstat(2) and creat(2) accepts.
+
+However, if there are configured encodings for the checkout and/or
+repository, then the defined conversions will occur between the
+readdir(2) and the index, in both directions.
 
  - The contents of the blob objects are uninterpreted sequence
    of bytes.  There is no encoding translation at the core
-- 
1.5.3.5

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux