Re: case insensitive collation of Greek's sigma

Jakub Jedelsky <jakub.jedelsky@xxxxxxxxxxxx> · Thu, 2 Dec 2021 14:26:39 +0100

On Wed, Dec 1, 2021 at 8:49 PM Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
Peter Eisentraut <peter.eisentraut@xxxxxxxxxxxxxxxx> writes:

> Running lower() like this is really the wrong thing to do.  We should be 

> doing "case folding" instead, which normalizes these differences for the 

> purpose of case-insensitive comparisons.

That just begs the question: if tolower (or towlower) isn't the

appropriate API, what is?  Perhaps ICU has something for a more

generalized notion of case-similarity, but I'm not aware of any such

thing in the POSIX API.

BTW, I think it's only accidental that the regex example shown upthread

gets the right answer.  In that example, what's happening is that we

consider a letter in a case-insensitive regex to match itself, or

tolower() of itself, or toupper() of itself.  Both σ and ς have Σ

as toupper() so they both work.  But if you'd written Σ in the regex,

only one of σ and ς would match that as a data character.  (Haven't

actually tested this, but given the way the code works I'm pretty

sure it's so.)  Again, it's hard to see how to do better atop a POSIX

locale library.

Thanks for digging into the issue.

Based on GNU docs [1] the POSIX APIs are not ready for that. Anyway, is it possible to keep current behaviour with lowercase in POSIX as a fallback and have the correct solution for ICU? I think (not an expert though) there should be already working code for case folding for some time already.

[1] https://www.gnu.org/software/libunistring/
"""
Text files are nowadays usually encoded in Unicode, and may consist of very different scripts – from Latin letters to Chinese Hanzi –, with many kinds of special characters – accents, right-to-left writing marks, hyphens, Roman numbers, and much more. But the POSIX platform APIs for text do not contain adequate functions for dealing with particular properties of many Unicode characters. In fact, the POSIX APIs for text have several assumptions at their base which don't hold for Unicode text.
"""