[PATCH] Documentation/i18n.txt: clarify character encoding support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



As a "distributed" VCS, git should better define the encodings of its core
textual data structures, in particular those that are part of the network
protocol.

That git is encoding agnostic is only really true for blob objects. E.g.
the 'non-NUL bytes' requirement of tree and commit objects excludes
UTF-16/32, and the special meaning of '/' in the index file as well as
space and linefeed in commit objects eliminates EBCDIC and other non-ASCII
encodings.

Git expects bytes < 0x80 to be pure ASCII, thus CJK encodings that partly
overlap with the ASCII range are problematic as well. E.g. fmt_ident()
removes trailing 0x5C from user names on the assumption that it is ASCII
'\'. However, there are over 200 GBK double byte codes that end in 0x5C.

UTF-8 as default encoding on Linux and respective path translations in the
Mac and Windows versions have established UTF-8 NFC as de-facto standard
for path names.

Update the documentation in i18n.txt to reflect the current status-quo.

Signed-off-by: Karsten Blees <blees@xxxxxxx>
---
 Documentation/i18n.txt | 30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/Documentation/i18n.txt b/Documentation/i18n.txt
index e9a1d5d..e5f6233 100644
--- a/Documentation/i18n.txt
+++ b/Documentation/i18n.txt
@@ -1,18 +1,28 @@
-At the core level, Git is character encoding agnostic.
-
- - The pathnames recorded in the index and in the tree objects
-   are treated as uninterpreted sequences of non-NUL bytes.
-   What readdir(2) returns are what are recorded and compared
-   with the data Git keeps track of, which in turn are expected
-   to be what lstat(2) and creat(2) accepts.  There is no such
-   thing as pathname encoding translation.
+Git is to some extent character encoding agnostic.
 
  - The contents of the blob objects are uninterpreted sequences
    of bytes.  There is no encoding translation at the core
    level.
 
- - The commit log messages are uninterpreted sequences of non-NUL
-   bytes.
+ - Pathnames are encoded in UTF-8 normalization form C. This
+   applies to tree objects, the index file, ref names and
+   config files (`.git/config` (see linkgit:git-config[1]),
+   linkgit:gitignore[5], linkgit:gitattributes[5] and
+   linkgit:gitmodules[5]).
+   The Mac and Windows versions automatically translate pathnames
+   to and from UTF-8 NFC in their readdir(2), lstat(2), creat(2)
+   etc. APIs. However, there is no such translation on other
+   platforms. If file system APIs don't use UTF-8 (which may be
+   file system specific), it is recommended to stick to pure
+   ASCII file names. While Git technically supports other
+   extended ASCII encodings at the core level, such repositories
+   will not be portable.
+
+ - Commit log messages are typically encoded in UTF-8, but other
+   extended ASCII encodings are also supported. This includes
+   ISO-8859-x, CP125x and many others, but _not_ UTF-16/32,
+   EBCDIC and CJK multi-byte encodings (GBK, Shift-JIS, Big5,
+   EUC-x, CP9xx etc.).
 
 Although we encourage that the commit log messages are encoded
 in UTF-8, both the core and Git Porcelain are designed not to
-- 
2.4.1.windows.1

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]