Re: mass-removal of LANG=anything-not-C.UTF-8 in packages

Florian Weimer <fweimer@xxxxxxxxxx> · Tue, 06 Nov 2018 15:59:34 +0100

* Panu Matilainen:

> On 11/06/2018 02:13 PM, Mike FABIAN wrote:
>> Panu Matilainen <pmatilai@xxxxxxxxxx> さんはかきました:
>>
>>> On 11/06/2018 12:15 PM, Zbigniew Jędrzejewski-Szmek wrote:
>>>> On Tue, Nov 06, 2018 at 12:10:04PM +0200, Panu Matilainen wrote:
>>>>> On 11/06/2018 03:05 AM, Kevin Kofler wrote:
>>>>>> Zbigniew Jędrzejewski-Szmek wrote:
>>>>>>> The first step is to replace LC_ALL=en_US.UTF-8 with LC_ALL=C.UTF-8
>>>>>>> (and similarly for LANG=, LC_CTYPE=, etc.) in all spec files.
>>>>>>
>>>>>> But there are probably many more packages where the setting is hidden in
>>>>>> upstream build scripts.
>>>>>
>>>>> Build- and various other scripts.
>>>>>
>>>>> Is C.UTF-8 glibc upstream now, or is it still Fedora-specific?
>>>>
>>>> It was never Fedora-specific. The original justification in 2013 or so
>>>> was "other distros already do it". It's just glibc upstream that doesn't
>>>> have it.
>>>>
>>>> We still carry
>>>> https://src.fedoraproject.org/rpms/glibc/blob/master/f/glibc-c-utf8-locale.patch,
>>>> so it seems this hasn't been upstream.
>>>
>>> Ugh, this is a rather cumbersome situation for other projects:
>>> supporting and using C.UTF-8 isn't going to happen large scale until
>>> it's upstreamed. And it does make one wonder what exactly is
>>> preventing it from being upstreamed in glibc.
>>
>> The current C.UTF-8 locale doesn’t sort correctly. It should sort
>> according to code point order, but it does that only partly. It is sort
>> of a quick hack. The glibc developers are working on a better solution
>> but this takes more time.
>>
>
> Hmm. Not sorting correctly doesn't sound so good when LANG=C (and now
> C.UTF-8) is quite commonly used exactly for that purpose.

Not all looks fixable to me in the current setting.  We expose the table
layout via nl_langinfo, so that's part of the ABI, and the tables just
cannot express the sorting order with less than three to four bytes per
codepoint.  That's a lot of data even if we restrict ourselves to the
modern UTF-8 range (those codepoints addressable using UTF-16 surrogate
pairs).

I think we could generate the tables on the fly if they are ever
requested using nl_langinfo.  Not many applications seem to do that.
Internally within glibc, we could use a different interface to avoid the
table generation.

The table layout also has significant problems with expressing proper
collation tables.  We need to investigate this more deeply, but my
impression is that the collation and collation sequence tables
constitute a significant fraction of the locale data on disk.  Changing
the table layout again has ABI implications there, similar to those for
C.UTF-8, except that the on-the-fly conversation code will be more
difficult to write.

Thanks,
Florian
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx