Re: [PATCH 00/22] Refactor to accept NUL in commit messages

Jeff King <peff@xxxxxxxx> · Thu, 27 Oct 2011 11:13:03 -0700

On Tue, Oct 25, 2011 at 07:07:38AM -0700, Junio C Hamano wrote:

> Jeff King <peff@xxxxxxxx> writes:
> 
> > I mean, besides the obvious that UTF-16 is ...
> 
> Yes, you could, besides the obvious. But that obvious reason makes it
> sufficiently different that it may not be so outrageous to draw the line
> between it and all the others.

Yeah, and I'm OK with that. It's just not a satisfying answer to give
Windows people who think UTF-16 is a good idea. But at the very least,
it's still unicode. It should be lossless for them to convert to utf8
and back if they want.

Speaking of which, I've been looking at handling diffing of utf-16
files. Right now we generally just consider them binary, which sucks.
It's easy to identify them by BOM in the is_buffer_binary() code, but
that's only part of it. We do an OK job of diffing them, except that:

  1. The BOM makes some diffs a little noisier.

  2. We split lines on 0x0a. But this byte can appear in other code
     points, like 0x010a (Ċ), or the entire entire 0x0a* code point (the
     entire Gurmukhi charset).

I'm tempted to detect the UTF-{16,32}{LE,BE} by their BOM, reencode them
to utf8, and then display them in utf8. Is that too gross for us to
consider?

You can kind-of implement this outside of git using textconv. But you
have to manually mark each file as utf-16, as there's no way to trigger
an alternative diff driver on something like a BOM.

I'm really not clear on how people with utf-16 files work. Even if we
did treat utf-16 like text, the _rest_ of git is outputting ascii, so
it's not like their terminals are utf-16. But we do have projects on
github with utf-16 and utf-32 encodings.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html