Re: [PATCH v3 5/5] fast-export: do automatic reencoding of commit messages only if requested

Torsten Bögershausen <tboegi@xxxxxx> · Mon, 13 May 2019 14:56:35 +0200

On Mon, May 13, 2019 at 12:23:29PM +0200, Johannes Schindelin wrote:
> Hi Elijah,
>
> On Sat, 11 May 2019, Elijah Newren wrote:
>
> > [...] the craziness is based on how Windows behaves; it seems insane to
> > me that Windows decides to munge user data (in the form of the command
> > line provided), so much so that it makes me wonder if I really
> > understood Hannes' and Dscho's explanations of what it is doing.
>
> It is not the user data that is munged by *Windows*, but by *Git for
> Windows*. The user data on Windows is encoded in UTF-16 (or some slight
> variant thereof). Git *cannot* handle UTF-16. Git's test suite *cannot*
> handle UTF-16. So we convert. That's all there is to it.
>
> Ciao,
> Dscho
>
> P.S.: Of course it is not *all* there is to it. There is also a current
> code page which depends on the current user's current locale. We can
> definitely not rely on that, as Git has no idea about this and would quite
> positively produce incorrect output because of it. So we really just use
> the `*W()` functions of the Win32 API (i.e. the ones accepting wide
> Unicode characters and strings, i.e. UTF-16). I don't think we can do
> better than that.

We can actuall feed valid UTF-8 into a test case.
(Remember that shell scripts need this octal numbering, see
t/t0050)

See the "ä" code point:
$ auml=$(printf '\303\244')
$ printf $auml
ä

Now we can feed those 2 bytes (wich are valid UTF) into
Git and say "convert them from ISO-8859-1 into UTF-8,
resulting in 4 bytes.
Is my explanation clear enough ?
If not, plese tell me.