Re: [PATCH resend] Do not create commits whose message contains NUL

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Jan 01, 2012 at 11:27:31AM -0500, Drew Northup wrote:

> I had already started experimenting with automatically detecting decent
> UTF-16 a long while back so that compatible platforms could handle it
> appropriately in terms of creating diffs and dealing with newline
> munging between platforms. There is no 100% sure-fire check for UTF-16
> if you don't already suspect it is possibly UTF-16. If we really want to
> check for possible UTF-16 specifically I can scrape out the check I
> wrote up and send it along.

I also looked into this recently. You can generally detect UTF-16 by the
BOM at the beginning of the file (which will also tell you the
endian-ness). I did a simple test by integrating it into the check for
binary-ness during diffs. However, as I recall, the result wasn't
particularly useful. Some of the diff code wasn't happy with the
embedded NUL bytes (i.e., there is code that assumes that NUL is the end
of a string). Not to mention that ascii newline (0x0a) can appear as
part of other characters in a wide encoding like utf-16. And since git
outputs straight ascii for all of the diff boilerplate, you end up with
a mish-mash of utf-16 and ascii (this is OK with utf-8, of course,
because utf-8 is a superset of ascii).

If anything, I think you would want to do something like "textconv" to
convert the utf-16 into utf-8, then diff that. Git won't do it
automatically based on encoding, but if you know the filenames of the
utf-16 files in your repository, you can do something like:

  echo 'foo.txt diff=utf16' >.gitattributes
  git config diff.utf16.textconv 'iconv -f utf16 -t utf8'

and get readable diffs. Of course you couldn't use that diff to apply a
patch, though.

I strongly suspect that not many people are really using git for utf-16
files. Git treats them as binary, which makes them unpleasant for
anything except simple storage.

> The is_utf8 check was not written to detect 100% valid UTF-8 per-se. It
> seems to me that it was written as part of the "is this a binary or not"
> check in the add/commit path.

We shouldn't care about binary file content at all in the add or commit
code paths. I would guess we do only if you are using auto-crlf (but
then, I don't think we care about utf8 in that cases, only whether line
endings should be converted or not).

We do check that the commit message itself is utf8, but only to generate
a warning that you should set i81n.commitencoding.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]