Re: Odd encoding issue with UTF-8 + gettext yields ? on non-ASCII

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Aug 29, 2010 at 20:45, Jonathan Nieder <jrnieder@xxxxxxxxx> wrote:
> Ævar Arnfjörð Bjarmason wrote:
>
>> So, my plan of attack is:
>>
>>  * Add compat/printf from Free, Open or NetBSD. Maybe make
>>    compat/snprintf.c use that while I'm at it.
>
> I would prefer to get this fixed in glibc, but of course that
> has nothing to do with git.

Yeah, but even if it's fixed there everyone's glibc won't be updated
for at least ten years as far as the glibc's we have to support go.

So even if the bug were fixed upstream today we'd still need a
workaround.

>>  * Use that instead of the GNU libc printf on systems that have glibc.
>>  * Add a configure check for that.
>>  * Revert 107880a
>>  * Get gettext goodness with LC_CTYPE
>>
>> Does anyone see a problem with that? The potential issue is that
>> LC_CTYPE is for:
>>
>>     "regular expression matching,
>
> should be okay, I think (unless http-backend is a problem)

User-level commands that take regexes would have different semantics
based on the locale though, e.g. git log --grep=<regex>.

>> character classification,
>
> worked around (see git-compat-util.h)

Yay sane_istest!

>>     conversion,
>
> I don't know what this means; iconv() is not affected by LC_CTYPE,
> is it?

I think it's only to do with functions like btowc, see:
http://www.gnu.org/s/libc/manual/html_node/Restartable-multibyte-conversion.html#Restartable-multibyte-conversion

>> case-sensitive comparison,
>
> Could be a problem: we use strcasecmp() heavily.

Yeah, strcasecmp is affected by LC_CTYPE.

>> and wide character
>>     functions."
>
> no problem. :)

Nope.

>> So it might have unintended side-effects. But the only other
>> workaround I can see is to decree that all consumers of the localized
>> messages must have a UTF-8 locale.
>
> And that is no workaround at all; the problem is still seen with UTF-8
> locales, no?

No, it'll be seen with all non-UTF-8 locales. Here's the issue:

When we add non-ASCII to the po/*.po files we'll write it in UTF-8 as
a matter of policy, simply because that's all the rave these days.

However, unless we put "Content-Type: text/plain; charset=UTF-8\n" in
the file the gettext utilities won't *know* that it's in UTF-8, if
it's not there then to them it'll just be a raw stream of bytes. So
they won't do the right conversion under non-UTF-8 locales.

But users using a gettext translation under a UTF-8 locale won't tell
the difference, since the *.po encoding and their expected encoding
don't differ they don't need any conversion anyway.

We can even keep the "Content-Type: text/plain; charset=UTF-8\n" and
*not* use LC_CTYPE if we add a bind_textdomain_codeset("git", "UTF-8")
call to gettext.c. That call declares that the *.po file is in UTF-8
(but without LC_CTYPE there still won't be any conversion), see
http://www.gnu.org/s/libc/manual/html_node/Charset-conversion-in-gettext.html#Charset-conversion-in-gettext

Here's a table explaining the various approaches:

    A: [correctness] LC_CTYPE + *.po charset=UTF-8
    B: [UTF-8-only hack] no LC_CTYPE + no *.po charset=UTF-8
    C: [UTF-8-only hack] no LC_CTYPE + A *.po charset=UTF-8 +
bind_textdomain_codeset("git", "UTF-8")

    | Approach | Correct *.po encoding header | GNU printf() issue |
LANG=is_IS.utf8 OK | LANG=is_IS.iso88591 OK  |
    |----------+------------------------------+--------------------+--------------------+-------------------------|
    | A        | X                            | X                  | X
                 | X                       |
    | B        | No                           | No, no LC_CTYPE    | X
                 | No, still outputs UTF-8 |
    | C        | X                            | No, no LC_CTYPE    | X
                 | No, still outputs UTF-8 |

A would be preferred for correctness, and with a fallback BSD printf()
we can avoid the GNU libc bug, however that'll mean using LC_CTYPE,
which'll have some small side-effects for the rest of the code.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]