Re: Concerning about Unicode-aware string handling

"Albe Laurenz" <laurenz.albe@xxxxxxxxxx> · Mon, 21 May 2012 15:38:03 +0200

Vincas Dargis wrote:
> We have problems (currently using 8.4, but also in latest 9.1.3) in
> our application with Unicode word symbols in Lithuanian ('ąčęėįšųūž'),
> Russian and of course potentially other languages.
> 
> For example, regex_replace('acząčž', E'\\W', '', 'g') removes ąčž.
> 
> lower() and ~* comparison works only with locale that is set (no
> internationalization).
> 
> Could we expect Unciode support in near future? Or should we do quick
> hacks by reimplementing regexp_replace(), lower(), upper() and other
> string SQL functions using, for example, Qt libraries..? Or maybe are
> there some kind simpler workarounds?

I tried it with 9.1.3 on Linux:

upper() and lower() works fine, no matter what the
database encoding is:

test=> SELECT upper('acząčž');
 upper
--------
 ACZĄČŽ
(1 row)

And this seems OK with LATIN7:

lt2=> SHOW server_encoding;
 server_encoding
-----------------
 LATIN7
(1 row)

lt2=> SHOW lc_ctype;
 lc_ctype
----------
 lt_LT
(1 row)

lt2=> SHOW lc_collate;
 lc_collate
------------
 lt_LT
(1 row)

lt2=> SELECT 'ą' ~* '\w';
 ?column?
----------
 t
(1 row)

But it looks wrong with UTF8:

lt=> SHOW server_encoding;
 server_encoding
-----------------
 UTF8
(1 row)

lt=> SHOW lc_ctype;
  lc_ctype
------------
 lt_LT.utf8
(1 row)

lt=> SHOW lc_collate;
 lc_collate
------------
 lt_LT.utf8
(1 row)

lt=> SELECT 'ą' ~* '\w';
 ?column?
----------
 f
(1 row)

Is that what you are complaining about?

Yours,
Laurenz Albe

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general