Search Postgresql Archives

Re: [to_tsvector] German Compound Words

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Alright. I got it running and used http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ ; specifically: http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz

Not sure where to find up-to-date/authorized the ispell dictionaries. I figured that I need to change this particular dictionary in order to avoid "ion" being split aways from words like "produktION/konstruktION" etc:

=# select * from ts_debug('public.german_compound_ispell', 'konstruktion');+
   alias   |   description   |    token     |        dictionaries         |  dictionary   |           lexemes           
-----------+-----------------+--------------+-----------------------------+---------------+------------------------------
 asciiword | Word, all ASCII | konstruktion | {german_ispell,german_stem} | german_ispell | {konstruktion,konstrukt,ion}


The splitting of compound words is unfortunately not consistent (wasserkraft vs konstruktionsplan):

=# select * from ts_debug('public.german_compound_ispell', 'wasserkraft');
   alias   |   description   |    token    |        dictionaries         |  dictionary   |          lexemes          
-----------+-----------------+-------------+-----------------------------+---------------+----------------------------
 asciiword | Word, all ASCII | wasserkraft | {german_ispell,german_stem} | german_ispell | {wasserkraft,wasser,kraft}

=# select * from ts_debug('public.german_compound_ispell', 'konstruktionsplan');
   alias   |   description   |       token       |        dictionaries         |  dictionary   |       lexemes      
-----------+-----------------+-------------------+-----------------------------+---------------+---------------------
 asciiword | Word, all ASCII | konstruktionsplan | {german_ispell,german_stem} | german_ispell | {konstruktion,plan}


Not sure how the 'sch' come to be:

=# select * from ts_debug('public.german_compound_ispell', 'rundflansch');
   alias   |   description   |    token    |        dictionaries         |  dictionary   |           lexemes           
-----------+-----------------+-------------+-----------------------------+---------------+------------------------------
 asciiword | Word, all ASCII | rundflansch | {german_ispell,german_stem} | german_ispell | {rund,flansch,rund,flan,sch}


This is another funny example:

=# select * from ts_debug('public.german_compound_ispell', 'datenbanken');
   alias   |   description   |    token    |        dictionaries         |  dictionary   |                                     lexemes                                    
-----------+-----------------+-------------+-----------------------------+---------------+---------------------------------------------------------------------------------
 asciiword | Word, all ASCII | datenbanken | {german_ispell,german_stem} | german_ispell | {datenbank,daten,date,banken,daten,date,bank,daten,date,banken,daten,date,bank}



On 01.06.2015 09:25, Sven R. Kunze wrote:
I actually wanted to minimize the installation effort. Thus, I used the hunspell-de-de package of Debian/Ubuntu.

Give me a second for ispell.

Below, see the hunspell variant for Produktionsintervall/Produktionintervall:

=# select * from ts_debug('public.german_compound', 'Produktionsintervall');
   alias   |   description   |        token         |         dictionaries          | dictionary  |        lexemes        
-----------+-----------------+----------------------+-------------------------------+-------------+------------------------
 asciiword | Word, all ASCII | Produktionsintervall | {german_hunspell,german_stem} | german_stem | {produktionsintervall}
(1 row)

=# select * from ts_debug('public.german_compound', 'Produktionintervall');
   alias   |   description   |        token        |         dictionaries          | dictionary  |        lexemes       
-----------+-----------------+---------------------+-------------------------------+-------------+-----------------------
 asciiword | Word, all ASCII | Produktionintervall | {german_hunspell,german_stem} | german_stem | {produktionintervall}



PS: I post your answer to the list as well

On 28.05.2015 19:42, Oleg Bartunov wrote:
For readability it's better to use

select * from ts_debug

I remember there is problem with correct support of hunspell files. Did you try ispell files ?
Also, I found this message http://www.postgresql.org/message-id/dm1ece$2gb5$1@xxxxxxxxxxxx

Try this word - Produktionintervall


On Thu, May 28, 2015 at 6:34 PM, Sven R. Kunze <srkunze@xxxxxxxxxxxx> wrote:
Sure. Here you are:

=# select ts_debug('public.german_compound', 'wasserkraft');
                                              ts_debug                                              
-----------------------------------------------------------------------------------------------------
 (asciiword,"Word, all ASCII",wasserkraft,"{german_hunspell,german_stem}",german_stem,{wasserkraft})

=# select ts_debug('public.german_compound', 'schifffahrt');
                                                ts_debug                                                
---------------------------------------------------------------------------------------------------------
 (asciiword,"Word, all ASCII",schifffahrt,"{german_hunspell,german_stem}",german_hunspell,{schifffahrt})

=# select ts_debug('public.german_compound', 'blindflansch');
                                               ts_debug                                               
-------------------------------------------------------------------------------------------------------
 (asciiword,"Word, all ASCII",blindflansch,"{german_hunspell,german_stem}",german_stem,{blindflansch})

That is my testing configuration:

=# \dF+ german_compound
Text search configuration "public.german_compound"
Parser: "pg_catalog.default"
      Token      |        Dictionaries        
-----------------+-----------------------------
 asciihword      | german_hunspell,german_stem
 asciiword       | german_hunspell,german_stem
 email           | simple
 file            | simple
 float           | simple
 host            | simple
 hword           | german_hunspell,german_stem
 hword_asciipart | german_hunspell,german_stem
 hword_numpart   | simple
 hword_part      | german_hunspell,german_stem
 int             | simple
 numhword        | simple
 numword         | simple
 sfloat          | simple
 uint            | simple
 url             | simple
 url_path        | simple
 version         | simple
 word            | german_hunspell,german_stem


On 28.05.2015 17:24, Oleg Bartunov wrote:
ts_debug() ?

=# select * from ts_debug('english', 'messages');
   alias   |   description   |  token   |  dictionaries  |  dictionary  | lexemes
-----------+-----------------+----------+----------------+--------------+----------
 asciiword | Word, all ASCII | messages | {english_stem} | english_stem | {messag}


On Thu, May 28, 2015 at 2:05 PM, Sven R. Kunze <srkunze@xxxxxxxxxxxx> wrote:
Hi everybody,

what do I need to do in order to enable compound word handling in PostgreSQL tsvector implementation?

I run an Ubuntu 14.04 machine, PostgreSQL 9.3, have installed package hunspell-de-de and already created a new dictionary as described here: http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY

CREATE TEXT SEARCH DICTIONARY german_hunspell (
    TEMPLATE = ispell,
    DictFile = de_de,
    AffFile = de_de,
    StopWords = german
);

Furthermore, created a new test text search configuration (copied from german) and updated all parser parts where the german_stem dictionary is used so that it uses german_hunspell first and then german_stem.

However, ts_vector still does not work for the compound words such as:

wasserkraft -> wasserkraft, kraft
schifffahrt -> schifffahrt, fahrt
blindflansch -> blindflansch, flansch

etc.


What have I done wrong here?

--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srkunze@xxxxxxxxxxxx
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543



--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



-- 
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srkunze@xxxxxxxxxxxx
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543



-- 
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srkunze@xxxxxxxxxxxx
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543


-- 
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srkunze@xxxxxxxxxxxx
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux