On 26.11.21 08:37, Jakub Jedelsky wrote:
postgres=# SELECT
postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en_US",
postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en_US"
postgres-# ;
?column? | ?column?
----------+----------
t | f
(1 row)
postgres=# SELECT
postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en-US-x-icu",
postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en-US-x-icu";
?column? | ?column?
----------+----------
f | t
(1 row)
If I could start, I think both results are wrong as both should return
True. If I got it right, in the background there is a lower() function
running to compare strings, which is not enough for such cases (until
the left side isn't taken as a standalone word).
The reason for these results is that for multibyte encodings, a ILIKE b
basically does lower(a) LIKE lower(b), and
select lower('ΣΣ' COLLATE "en_US"), lower('ΣΣ' COLLATE "en-US-x-icu");
lower | lower
-------+-------
σσ | σς
Running lower() like this is really the wrong thing to do. We should be
doing "case folding" instead, which normalizes these differences for the
purpose of case-insensitive comparisons.