Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII

Marcin Cieslak <saper@xxxxxxxxxx> · Mon, 30 Aug 2010 14:00:11 +0000

On Mon, 30 Aug 2010, Ævar Arnfjörð Bjarmason wrote:

On Sun, Aug 29, 2010 at 20:45, Jonathan Nieder <jrnieder@xxxxxxxxx> wrote:
A would be preferred for correctness, and with a fallback BSD printf()
we can avoid the GNU libc bug, however that'll mean using LC_CTYPE,
which'll have some small side-effects for the rest of the code.

The real problem is that you are probably using same functions
(locale-enable) for the user-facing side as well as for the 
backend (talking to repository). Some projects decided to use
some special encoding internally (like UCS-2 in case of Java
and Python 2.x, Unicode ordinals in Python 3.x). Otherwise
you may end up in some incompatibilities in the on-disk on 
on-network format. I don't think you want to keep telling all bug 
reporters for few years - "Can you try that again with env LANG=C,
please?" :)

Bringing Unicode onboard means that simple strlen() is no longer
what you normally think it does.

On Mon, 30 Aug 2010, Jonathan Nieder wrote:

Ævar Arnfjörð Bjarmason wrote:

We can even keep the "Content-Type: text/plain; charset=UTF-8\n" and
*not* use LC_CTYPE if we add a bind_textdomain_codeset("git", "UTF-8")
call to gettext.

Oh!  I'd personally prefer to do that for now. :)  (Not because of the
known printf problem but because I like to reduce possible unknowns.)

Well, in this case everybody will be force to have UTF-8 in output
on-screen, not useful for people using ISO8859-*, KOI8-R and similar
things...

--Marcin