No,
I need a solution which is as generic as possible. I use UTF-8 encoded
unicode strings on all levels. This is what I have done so far:
1) Writing a separate Python command line script for testing - works as
expected:
#!/usr/bin/python
import sys
import unicodedata
str = sys.argv[1].decode('UTF-8')
str = unicodedata.normalize('NFKD', str)
str = ''.join(c for c in str if unicodedata.combining(c) == 0)
print str
2) Transfering this to PL/Python:
CREATE OR REPLACE FUNCTION test (str text)
RETURNS text
AS $$
import unicodedata
return unicodedata.normalize('NFKD', str.decode('UTF-8'))
$$ LANGUAGE plpythonu;
Problem: plpython throws an error, where my commandline script did it
correctly:
# select test('aÄÖÜ');
ERROR: plpython: function "test" could not create return value
DETAIL: <type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't
encode character u'\u0308' in position 2: ordinal not in range(128)
I use PG 8.3 and Python 2.5.2. How can I make plpython behaving like in
a normal python environment?
In the end it should look like this:
CREATE TABLE t (
...
ts ts_vector NOT NULL
);
INSERT INTO t (ts) VALUES(to_tsvector(normalize(?)));
Andi
David Fetter schrieb:
On Wed, Sep 16, 2009 at 07:20:21PM +0200, Andreas Kalsch wrote:
Has somebody integrated Unicode normalization into Postgres? if not, I
would have to implement my own function by using this CPAN module:
http://search.cpan.org/~sadahiro/Unicode-Normalize-1.03/ .
I need a function which removes all diacritics (1) and transforms some
characters to a more compatible form (2) to get a better index on
strings.
Best,
Andi
1) à,ä, ... => a
2) ø => o, ƒ => f, ª => a
You mean something like this?
http://wiki.postgresql.org/wiki/Strip_accents_from_strings%2C_and_output_in_lowercase
Cheers,
David.
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general