Suggested clarification for .gitattributes reference documentation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The .gitattributes documentation should be clarified to ensure files encoded as UTF-16 are properly accounted for,
In particular for Windows users.

Specifically, within the working-tree encoding topic https://git-scm.com/docs/gitattributes#_working_tree_encoding, I suggest the following edits:


NEW BULLETED PARAGRAPH UNDER THE HEADING "Please note that using the working-tree-encoding 
attribute may have a number of pitfalls:"

    * Git for Windows is not able to access the iconv.exe text conversion program from an ordinary
      Command Prompt.  Be sure to run 'git clone' or 'git add' from a git bash console or a Git
      GUI.

OLD TEXT

    As an example, use the following attributes if your *.ps1 files are UTF-16 encoded with byte order mark (BOM) 
    and you want Git to perform automatic line ending conversion based on your platform.
    
    *.ps1       text working-tree-encoding=UTF-16
    
    Use the following attributes if your *.ps1 files are UTF-16 little endian encoded without BOM
    and you want Git to use Windows line endings in the working directory (use UTF-16LE-BOM instead 
    of UTF-16LE if you want UTF-16 little endian with BOM). Please note, it is highly recommended 
    to explicitly define the line endings with eol if the working-tree-encoding attribute is used 
    to avoid ambiguity.
    
    *.ps1      text working-tree-encoding=UTF-16LE eol=CRLF
    

NEW TEXT (SPECIFYING UTF-16BE EXPLICITLY IN THE FIRST EXAMPLE, AND WITH A NEW SEPARATE EXAMPLE FOR UTF-16LE WITH BOM)

    As an example, use the following attributes if your *.ps1 files are UTF-16 big endian encoded
    with byte order mark (BOM) and you want Git to perform automatic line ending conversion
    based on your platform.
    
    *.ps1       text working-tree-encoding=UTF-16
    
    Use the following attributes if your *.ps1 files are UTF-16 little endian encoded without BOM
    and you want Git to use Windows line endings in the working directory.
    
    *.ps1      text working-tree-encoding=UTF-16LE eol=CRLF
    
    Use the following attributes if your *.ps1 files are UTF-16 little endian encoded with BOM
    and you want Git to use Windows line endings in the working directory.
    
    *.ps1      text working-tree-encoding=UTF-16LE-BOM eol=CRLF

    Please note, it is highly recommended to explicitly define the line endings with eol 
    if the working-tree-encoding attribute is used to avoid ambiguity.
    
    Please note, Git for Windows does not support UTF-16LE encoding when running git
    commands from an ordinary Command Prompt.  Use a git bash console instead.


OLD TEXT:
    
    You can get a list of all available encodings on your platform with the following command:
    
    iconv --list


NEW TEXT:
    
    You can get a list of all available encodings on your platform with the following command:
    
    iconv --list
    
    For Git for Windows users the command, above, is only supported when running in a 'git bash' console.


In the thread "help request: unable to merge UTF-16-LE "text" file" at  https://lore.kernel.org/git/Yl8uiflurfjuLIvD@xxxxxxxxxxxxxxxxxxxxxxxxx/, Brian m. Carlson,  Chris Torek and others describe tips for dealing with improper encoding, such as the following:

    if you have already checked the file in without an appropriate
    working-tree-encoding, you should run `git add --renormalize .` and then
    commit.  You'll need to do that (or merge in a commit that does that) on
    every branch you want to work with.

    > For that to work, it is likely that you'd need to convert not just
    > the tips of two branches getting merged, but also the merge base
    > commit, so that all three trees involved in the 3-way merge are in
    > the same text encoding.

    The old merge-recursive has `-X renormalize` that I believe would
    do this for you. (I see code in merge-ort for this as well, but have no
    handy means to test it myself.)

So a NEW SECTION describing ways to deal with improper text file encoding could be added under the
working-tree-encoding topic, specifically a description of what the following two
commands can do to remedy improper encoding:

    git add --renormalize
    git merge-recursive -X renormalize


CONCLUSION:

Text files encoded with UTF-16LE with BOM are common in the Windows world, as some versions of Visual Studio will use this as the default encoding for .rc or .mc files.  Solution files, project files and other Visual Studio files can also be in this format.  Other encodings are common, too, e.g. some older versions of PowerShell defaulted to UTF-16BE with BOM for new .ps1 files. Yet users continue to experience encoding errors even when they are using the proper working-tree-encoding in their .gitattributes file.  Part of this is due to the complexity of Git and the number of different platforms it supports.

Ideally Git would automatically detect the most common UTF encodings and treat these files as diffable text files on all platforms -- without the need for entries in .gitattributes.  And it would be great if Git for Windows could handle common UTF text encodings when executed in an ordinary Command Prompt.  Until then, clarifying and enhancing the documentation for .gitattributes could go a long way in making text encoding easier for Git users.  Thanks for considering these revisions.

- Michael






[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux