On Tue, Oct 25, 2011 at 07:07:38AM -0700, Junio C Hamano wrote: > Jeff King <peff@xxxxxxxx> writes: > > > I mean, besides the obvious that UTF-16 is ... > > Yes, you could, besides the obvious. But that obvious reason makes it > sufficiently different that it may not be so outrageous to draw the line > between it and all the others. Yeah, and I'm OK with that. It's just not a satisfying answer to give Windows people who think UTF-16 is a good idea. But at the very least, it's still unicode. It should be lossless for them to convert to utf8 and back if they want. Speaking of which, I've been looking at handling diffing of utf-16 files. Right now we generally just consider them binary, which sucks. It's easy to identify them by BOM in the is_buffer_binary() code, but that's only part of it. We do an OK job of diffing them, except that: 1. The BOM makes some diffs a little noisier. 2. We split lines on 0x0a. But this byte can appear in other code points, like 0x010a (Ċ), or the entire entire 0x0a* code point (the entire Gurmukhi charset). I'm tempted to detect the UTF-{16,32}{LE,BE} by their BOM, reencode them to utf8, and then display them in utf8. Is that too gross for us to consider? You can kind-of implement this outside of git using textconv. But you have to manually mark each file as utf-16, as there's no way to trigger an alternative diff driver on something like a BOM. I'm really not clear on how people with utf-16 files work. Even if we did treat utf-16 like text, the _rest_ of git is outputting ascii, so it's not like their terminals are utf-16. But we do have projects on github with utf-16 and utf-32 encodings. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html