How to simplify unicode strings

Andreas Kalsch <andreaskalsch@xxxxxx> · Thu, 17 Sep 2009 01:37:47 +0200

Thank you Sam,

this leaded to the correct solution:

CREATE OR REPLACE FUNCTION simplify (str text)
RETURNS text
AS $$
import unicodedata

s = unicodedata.normalize('NFKD', str.decode('UTF-8'))
s = ''.join(c for c in s if unicodedata.combining(c) == 0)
return s.encode('UTF-8')
$$ LANGUAGE plpythonu;

test=# select simplify('Français va à Paris, () {} [] µ @ º Ångstrøm 
Phiat-im hû-hō sī phiat tī 1-ê ki-chhó· jī-bó bīn-téng ê hû-hō. Siōng 
phó·-phiàn ê kong-lêng sī kái-piàn ki-chhó· jī-bó ê hoat-im.');
simplify
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Francais va a Paris, () {} [] μ @ o Angstrøm Phiat-im hu-ho si phiat ti 
1-e ki-chho· ji-bo bin-teng e hu-ho. Siong pho·-phian e kong-leng si 
kai-pian ki-chho· ji-bo e hoat-im.
(1 row)

One question remains: How is the performance of PL/Python?
When there are syntax errors in the Python code, they are not reported 
on CREATE, because the function seems be recompiled on every call.

This leads to the next question: When will the unicode stuff included in 
the main distribution?

Andi

Sam Mason schrieb:
On Wed, Sep 16, 2009 at 09:35:02PM +0200, Andreas Kalsch wrote:

CREATE OR REPLACE FUNCTION test (str text)
 RETURNS text
AS $$
   import unicodedata
   return unicodedata.normalize('NFKD', str.decode('UTF-8'))
$$ LANGUAGE plpythonu;

I'd guess you want that to be:

  return unicodedata.normalize('NFKD', str.decode('UTF-8')).encode('UTF-8');

If you're converting from a utf8 encoding, you probably need to go
back again!  This could certainly be made easier though, PG knows what
encoding its strings are stored in, why doesn't it work with unicode
strings by default?

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general