Re: git clone corrupts file.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Aug 16, 2021 at 03:24:28PM +0000, Russell, Scott wrote:

> 1.  The files corrupted are in Unicode.   Though the .h file mentioned
>     certainly doesn't have to be Unicode, it can be ANSI, we have
>     other files that must be Unicode.  We use Unicode in quite a
>     number of our text files.

By Unicode, I'll assume you mean UTF-16, since your example below
appears to have a BOM marker at the beginning (FF FE).

Unlike UTF-8, UTF-16 is not a superset of ASCII, and thus can't be
treated as "text" by Git (e.g., the line ending byte is no longer just
hex "0A", but "00 0A").

>           f.    Entries in .gitattributes specified by type are specified for the affected files. 
>                         *.h     text eol=crlf
>                         *.ini   text eol=crlf

So this is your problem. The "text" attribute is telling Git to treat
the file as text (which will handle any ASCII-superset encoding like
UTF-8, ISO8859-1, etc, but not others like UTF-16, UTF-32, EUC-JP, etc).

Depending on what's in your repo and what you want to have happen,
you'll want to:

  - remove that attribute, if all of your ".h" files are UTF-16

  - if only some are UTF-16, you'll need to provide patterns that
    distinguish between the two types by giving them different
    attributes (e.g., "-text" should override for specific files)

  - you can stop there if you don't need line-ending conversion for
    UTF-16 files (and there may be little point; Git will treat them as
    binary for the purposes of diffing, so there is little point in
    matching the canonical in-repo endings)

  - if you do want to do line ending conversion (or any other
    modifications on them), you can do so with a custom clean/smudge
    filter (see the "filter" attribute in "git help attributes")

> I would like git to observe the autocrlf false as directed.

Hopefully the above explains it, but just to be clear, this isn't
autocrlf kicking in, but rather the "text" and "eol" attributes you've
specified.

> We can't convert the files to other encoding for convenience of git.

If you're happy enough not being able to get meaningful text diffs for
these files from Git, then the above should make your problem go away.

But an alternative workflow, if you really want UTF-16 in the working
tree, is to convert between UTF-8 and UTF-16 as the files go in and out
o the working tree. There's no built-in support for that, but you could
do it with a custom clean/smudge filter. That would let Git store UTF-8
internally, do diffs, etc.

One lighter alternative to that is to actually store UTF-16 in the
repository as you are now, but provide a textconv filter (see diff
attributes in "git help attributes") to convert it to UTF-8 on the fly
when showing a diff. You won't be able to apply such a diff, but they're
useful for human eyes.

-Peff



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux