On Wed, Jul 03 2019, Karsten Blees via GitGitGadget wrote: > From: Karsten Blees <blees@xxxxxxx> > > On native Windows, Git exclusively uses UTF-8 for console output (both > with MinTTY and native Win32 Console). Gettext uses `setlocale()` to > determine the output encoding for translated text, however, MSVCRT's > `setlocale()` does not support UTF-8. As a result, translated text is > encoded in system encoding (as per `GetAPC()`), and non-ASCII chars are > mangled in console output. > > Side note: There is actually a code page for UTF-8: 65001. In practice, > it does not work as expected at least on Windows 7, though, so we cannot > use it in Git. Besides, if we overrode the code page, any process > spawned from Git would inherit that code page (as opposed to the code > page configured for the current user), which would quite possibly break > e.g. diff or merge helpers. So we really cannot override the code page. > > In `init_gettext_charset()`, Git calls gettext's > `bind_textdomain_codeset()` with the character set obtained via > `locale_charset()`; Let's override that latter function to force the > encoding to UTF-8 on native Windows. > > In Git for Windows' SDK, there is a `libcharset.h` and therefore we > define `HAVE_LIBCHARSET_H` in the MINGW-specific section in > `config.mak.uname`, therefore we need to add the override before that > conditionally-compiled code block. > > Rather than simply defining `locale_charset()` to return the string > `"UTF-8"`, though, we are careful not to break `LC_ALL=C`: the > `ab/no-kwset` patch series, for example, needs to have a way to prevent > Git from expecting UTF-8-encoded input. It's not just the ab/no-kwset I have cooking (but happy to have this take that into account), but also anything grep-like is usually must faster with LC_ALL=C. Isn't that also the case on Windows? Setting locales affects a large variety of libc functions and third party libraries (e.g. PCRE via us setting "use UTF-8" under locale). > Signed-off-by: Karsten Blees <blees@xxxxxxx> > Signed-off-by: Johannes Schindelin <johannes.schindelin@xxxxxx> > --- > gettext.c | 20 +++++++++++++++++++- > 1 file changed, 19 insertions(+), 1 deletion(-) > > diff --git a/gettext.c b/gettext.c > index d4021d690c..3f2aca5c3b 100644 > --- a/gettext.c > +++ b/gettext.c > @@ -12,7 +12,25 @@ > #ifndef NO_GETTEXT > # include <locale.h> > # include <libintl.h> > -# ifdef HAVE_LIBCHARSET_H > +# ifdef GIT_WINDOWS_NATIVE > + > +static const char *locale_charset(void) > +{ > + const char *env = getenv("LC_ALL"), *dot; > + > + if (!env || !*env) > + env = getenv("LC_CTYPE"); > + if (!env || !*env) > + env = getenv("LANG"); > + > + if (!env) > + return "UTF-8"; > + > + dot = strchr(env, '.'); > + return !dot ? env : dot + 1; > +} > + > +# elif defined HAVE_LIBCHARSET_H > # include <libcharset.h> > # else > # include <langinfo.h> I'll take it on faith that this is what the locale_charset() should look like. I wonder if it wouldn't be better to always compile this function, and just have init_gettext_charset() switch between the two. We've moved more towards that sort of thing (e.g. with pthreads). I.e. prefer redundant compilation to ifdefing platform-only code (which then only gets compiled there). See "HAVE_THREADS" in the code. It looks to me that with this patch the HAVE_LIBCHARSET_H docs in "Makefile" become wrong. Shouldn't those be updated too? We also still pass -DHAVE_LIBCHARSET_H to every file we compile, only to never use it under GIT_WINDOWS_NATIVE, but perhaps fixing that isn't possible with GIT_WINDOWS_NATIVE being a macro, and perhaps I've again gotten the "native" v.s. "mingw" etc. relationship wrong in my head and the HAVE_LIBCHARSET_H docs are fine. It just seems wrong that we have both the configure script & config.mak.uname look for / declare that we have libcharset.h, only to at this late point not use libcharset.h at all. Couldn't we just know if GIT_WINDOWS_NATIVE will be true earlier & move that check up, so it & HAVE_LIBCHARSET_H can be mutually exclusive (with accompanying #error if we have both)?