RE: git-rebase is ignoring working-tree-encoding

"Alexandre Grigoriev" <alegrigoriev@xxxxxxxxx> · Tue, 25 Dec 2018 16:56:11 -0800

> -----Original Message-----
> From: git-owner@xxxxxxxxxxxxxxx [mailto:git-owner@xxxxxxxxxxxxxxx] On
> Behalf Of Torsten Bogershausen
> Sent: Thursday, November 8, 2018 9:03 AM
> To: Adrián Gimeno Balaguer
> Cc: git@xxxxxxxxxxxxxxx
> Subject: Re: git-rebase is ignoring working-tree-encoding
> 
> On Wed, Nov 07, 2018 at 05:38:18AM +0100, Adrián Gimeno Balaguer wrote:
> > Hello Torsten,
> >
> > Thanks for answering.
> >
> > Answering to your question, I removed the comments with "rebase" since
> > my reported encoding issue happens on more simpler operations
> > (described in the PR), and the problem is not directly related to
> > rebasing, so I considered it better in order to avoid unrelated
> > confusions.
> >

> OK, I think I understand your problem now.
> The file format which you ask for could be named "UTF-16-BOM-LE",
> but that does not exist in reality.
> If you use UTF-16, then there must be a BOM, and if there is a BOM,
> then a Unicode-aware application -should- be able to handle it.
> 
> Why does your project require such a format ?
> 

Many tools in Windows still do not understand UTF-8, although it's getting
better. I think Windows is about the only OS where tools still require
UTF-16 for full internationalization.
Many tools written in C use MSVC RTL, where fopen(), unfortunately, doesn't
understand UTF-16BE (though such a rudimentary program as Notepad does).

For this reason, it's very reasonable to ask that the programming tools
produce UTF-16 files with particular endianness, natural for the platform
they're running on.

The iconv programmers' boneheaded decision to always produce UTF-16BE with
BOM for UTF-16 output doesn't make sense.
Again, git and iconv/libiconv in Centos on x86 do the right thing and
produce UTF-16LE with BOM in this case.

Also, iconv/libiconv should not be rejecting files with BOM for input
encoding UTF-16BE or UTF-16LE.
The BOM is not some magic tag. It's just a zero-width space, with unique
property that its 8 and 16 bit encoding variants can be recognized one from
another. It can appear anywhere in a file.
If it's a first character in the file, then the file encoding can be
reliably detected. But it's just a character, and iconv should be accepting
such files as valid.