Re: [PATCH] Documentation/i18n.txt: clarify character encoding support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 15.06.2015 um 02:12 schrieb Junio C Hamano:
> Karsten Blees <karsten.blees@xxxxxxxxx> writes:
> 
>> diff --git a/Documentation/i18n.txt b/Documentation/i18n.txt
>> index e9a1d5d..e5f6233 100644
>> --- a/Documentation/i18n.txt
>> +++ b/Documentation/i18n.txt
>> @@ -1,18 +1,28 @@
>> -At the core level, Git is character encoding agnostic.
>> -
>> - - The pathnames recorded in the index and in the tree objects
>> -   are treated as uninterpreted sequences of non-NUL bytes.
>> -   What readdir(2) returns are what are recorded and compared
>> -   with the data Git keeps track of, which in turn are expected
>> -   to be what lstat(2) and creat(2) accepts.  There is no such
>> -   thing as pathname encoding translation.
>> +Git is to some extent character encoding agnostic.
> 
> I do not think the removal of the text makes much sense here unless
> you add the equivalent to the new text below.
> 
>>   - The contents of the blob objects are uninterpreted sequences
>>     of bytes.  There is no encoding translation at the core
>>     level.
>>  
>> - - The commit log messages are uninterpreted sequences of non-NUL
>> -   bytes.
>> + - Pathnames are encoded in UTF-8 normalization form C. This
> 
> That is true only on some systems like OSX (with HFS+) and Windows,
> no?  BSDs in general and Linux do not do any such mangling IIRC.

Modern Unices don't need any such mangling because UTF-8 NFC should
be the default system encoding. I'm not sure for BSDs, but it has
been the default on all major Linux distros for more than 10 years.

> I
> am OK with mangling described as a notable oddball to warn users,
> though; i.e. not as a norm as your new text suggests but as an
> exception.
> 

I would guess that non-UTF-8 Unices (or file systems) are the oddball
case, which is why I described them last. But I could be wrong.

>> +   platforms. If file system APIs don't use UTF-8 (which may be
>> +   file system specific), it is recommended to stick to pure
>> +   ASCII file names.
> 
> Hmph, who endorsed such a recommendation?  It is recommended to
> stick to whatever naming scheme that would not cause troubles to
> project participants.  If your participants all want to (and can)
> use ISO-8859-1, we do not discourage them from doing so.
> 

ISO-8859-x file names may be fine if you won't ever need to:
- use git-web, JGit, gitk, git-gui...
- exchange repos with "normal" (UTF-8) Unices, Mac and Windows systems
- publish your work on a git hosting service (and expect file and
  ref names to show up correctly in the web interface)
- store the repo on Unicode-based file systems (JFS, Joliet, UDF,
  exFat, NTFS, HFS, CIFS...)

These restrictions are not that obvious when you start a new git
project, and while converting file names after the fact is possible
(e.g. using the recodetree script we shipped with Git for Windows
1.7.10), it will destroy history.

Thus I think we should strongly discourage users from using anything
but UTF-8.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]