Search Postgresql Archives

German ispell dictionary: error parsing affix file

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I'm trying to get a German ispell dictionary to work with
postgresql 8.3.7 which supports compound words. I tried
the following three dictionaries:

- http://ftp.services.openoffice.org/pub/OpenOffice.org/contrib/dictionaries/de_DE_frami.zip
(for OpenOffice 2),
- http://extensions.services.openoffice.org/project/dict-de_DE_frami
(for OpenOffice 3) and
- http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz.

Each file was converted to UTF-8 via iconv. I created the
dictionary with the following command:

CREATE TEXT SEARCH DICTIONARY german_ispell (
    Template = ispell,
    DictFile = de_de_frami,
    AffFile = de_de_frami,
    StopWords = german
);

Then I test it via:

SELECT ts_lexize('german_ispell', 'haustür');

which should result in 'haus' and 'tür'. The first two
dictionaries return nothing at all. Compound words don't seem
to work with those two.

The third one works if I remove all lines containing any umlauts
from de_de_frami.affix and returns 'haus' and 'tür'. If I do not
remove all lines containing umlauts from the affix file I get a
syntax error during parsing:

ERROR:  syntax error
CONTEXT:  line 224 of configuration file
"/usr/local/share/postgresql/tsearch_data/de_de_frami.affix": "   ABE
  > -ABE,äBIN
"

Problem seems to be that postgresql runs on OpenBSD, which
does not support any locale but C. The affix file contains umlauts
and is encoded in UTF-8 as required by postgresql. But the
parsing fails probably due to the method parse_affentry in spell.c
and the method t_isalpha used within that function.

In t_isalpha there is:

if (clen == 1 || lc_ctype_is_c())
    return isalpha(TOUCHAR(ptr))

which fails for the umlauts in the affix file. is there any reason to
check for a lc_ctype of C here. The affix file is in UTF-8 and each line
is converted to the encoding used by the database. Why is there
a check for the C locale?

Or am I completly wrong and this is not the reason, the parsing of
the affix file fails.

Thanks for your help.

Christof

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux