Re: Unicode normalization

Andreas Kalsch <andreaskalsch@xxxxxx> · Wed, 16 Sep 2009 21:35:02 +0200

No,

I need a solution which is as generic as possible. I use UTF-8 encoded 
unicode strings on all levels. This is what I have done so far:

1) Writing a separate Python command line script for testing - works as 
expected:

#!/usr/bin/python

import sys
import unicodedata

str = sys.argv[1].decode('UTF-8')
str = unicodedata.normalize('NFKD', str)
str = ''.join(c for c in str if unicodedata.combining(c) == 0)
print str

2) Transfering this to PL/Python:

CREATE OR REPLACE FUNCTION test (str text)
 RETURNS text
AS $$
   import unicodedata
   return unicodedata.normalize('NFKD', str.decode('UTF-8'))
$$ LANGUAGE plpythonu;

Problem: plpython throws an error, where my commandline script did it 
correctly:

# select test('aÄÖÜ');

ERROR:  plpython: function "test" could not create return value
DETAIL:  <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't 
encode character u'\u0308' in position 2: ordinal not in range(128)

I use PG 8.3 and Python 2.5.2. How can I make plpython behaving like in 
a normal python environment?

In the end it should look like this:

CREATE TABLE t (
...
ts ts_vector NOT NULL
);

INSERT INTO t (ts) VALUES(to_tsvector(normalize(?)));

Andi

David Fetter schrieb:
On Wed, Sep 16, 2009 at 07:20:21PM +0200, Andreas Kalsch wrote:

Has somebody integrated Unicode normalization into Postgres? if not, I  
would have to implement my own function by using this CPAN module:  
http://search.cpan.org/~sadahiro/Unicode-Normalize-1.03/ .

I need a function which removes all diacritics (1) and transforms some  
characters to a more compatible form (2) to get a better index on 
strings.

Best,

Andi

1) à,ä, ... => a
2) ø => o, ƒ => f, ª => a

You mean something like this?

http://wiki.postgresql.org/wiki/Strip_accents_from_strings%2C_and_output_in_lowercase

Cheers,
David.

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general