Re: git-rebase is ignoring working-tree-encoding

Torsten Bögershausen <tboegi@xxxxxx> · Thu, 27 Dec 2018 15:45:25 +0100

On Wed, Dec 26, 2018 at 06:52:56PM -0800, Alexandre Grigoriev wrote:
> 
> > -----Original Message-----
> > From: brian m. carlson [mailto:sandals@xxxxxxxxxxxxxxxxxxxx]
> > Sent: Wednesday, December 26, 2018 11:25 AM
> > To: Alexandre Grigoriev
> > Cc: 'Torsten Bögershausen'; 'Adrián Gimeno Balaguer'; git@xxxxxxxxxxxxxxx
> > Subject: Re: git-rebase is ignoring working-tree-encoding
> > 
> > On Tue, Dec 25, 2018 at 04:56:11PM -0800, Alexandre Grigoriev wrote:
> > > Many tools in Windows still do not understand UTF-8, although it's
> > > getting better. I think Windows is about the only OS where tools still
> > > require
> > > UTF-16 for full internationalization.
> > > Many tools written in C use MSVC RTL, where fopen(), unfortunately,
> > > doesn't understand UTF-16BE (though such a rudimentary program as
> > Notepad does).
> > >
> > > For this reason, it's very reasonable to ask that the programming
> > > tools produce UTF-16 files with particular endianness, natural for the
> > > platform they're running on.
> > >
> > > The iconv programmers' boneheaded decision to always produce UTF-16BE
> > > with BOM for UTF-16 output doesn't make sense.
> > > Again, git and iconv/libiconv in Centos on x86 do the right thing and
> > > produce UTF-16LE with BOM in this case.
> > 
> > A program which claims to support "UTF-16" must support both
> > endiannesses, according to RFC 2781. A program writing UTF-16-LE must not
> > write a BOM at the beginning. I realize this is inconvenient, but the bad
> > behavior of some Windows programs doesn't mean that Git should ignore
> > interoperability with non-Windows systems using UTF-16 correctly in favor of
> > Windows.
> 
> OK, we have a choice either:
> a) to live in that corner of the real world where you have to use available tools, some of which have historical reasons
> to only support UTF-16LE with BOM, because nobody ever throws a different flavor of UTF-16 at them;
> Or b) to live in an ivory tower where you don't really need to use UTF-16 LE or BE or any other flavor,
> because everything is just UTF-8, and tell all those other people using that lame OS to shut up and wait until their tools start to support
> the formats you don't really have to care about;
> 
> > behavior of some Windows programs doesn't mean that Git should ignore
> > interoperability with non-Windows systems using UTF-16 correctly in favor of
> > Windows.
> 
> Yes, Git (actually libiconv) should not ignore interoperability.
> This means it should check out files on a *Windows* system in a format which *Windows* tools
> can understand.
> And, by the way, Centos (or RedHat?) developers understood that.
> There, on an x86 installation, when you ask for UTF-16, it produces UTF-16LE with BOM.
> Just as every user there would want.
> 
> 

Sorry if I feel confused here - does the problem still exist ?
If yes, does the following patch help ?

diff --git a/utf8.c b/utf8.c
index eb78587504..2facef84d4 100644
--- a/utf8.c
+++ b/utf8.c
@@ -9,6 +9,23 @@ struct interval {
 	ucs_char_t last;
 };
 
+static int has_bom_prefix(const char *data, size_t len,
+			  const char *bom, size_t bom_len)
+{
+	return data && bom && (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {'\xFE', '\xFF'};
+static const char utf16_le_bom[] = {'\xFF', '\xFE'};
+static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
+static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
+
+static inline uint16_t default_swab16(uint16_t val)
+{
+	return (((val & 0xff00) >>  8) |
+		((val & 0x00ff) <<  8));
+}
+
 size_t display_mode_esc_sequence_len(const char *s)
 {
 	const char *p = s;
@@ -556,21 +573,19 @@ char *reencode_string_len(const char *in, size_t insz,
 
 	out = reencode_string_iconv(in, insz, conv, outsz);
 	iconv_close(conv);
+	if (has_bom_prefix(out, *outsz, utf16_be_bom, sizeof(utf16_be_bom))) {
+		/* UTF-16 should be little endian under Git */
+		size_t    num_points = *outsz / sizeof(uint16_t);
+		uint16_t *point = (uint16_t*) out;
+		while (num_points--) {
+			*point = default_swab16(*point);
+			point++;
+		}
+	}
 	return out;
 }
 #endif
 
-static int has_bom_prefix(const char *data, size_t len,
-			  const char *bom, size_t bom_len)
-{
-	return data && bom && (len >= bom_len) && !memcmp(data, bom, bom_len);
-}
-
-static const char utf16_be_bom[] = {'\xFE', '\xFF'};
-static const char utf16_le_bom[] = {'\xFF', '\xFE'};
-static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
-static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
-
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 {
 	return (