Re: [to_tsvector] German Compound Words

"Sven R. Kunze" <srkunze@xxxxxxxxxxxx> · Mon, 01 Jun 2015 10:13:05 +0200

    Alright. I got it running and used
      http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ ;
      specifically:
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz

      Not sure where to find up-to-date/authorized the ispell
      dictionaries. I figured that I need to change this particular
      dictionary in order to avoid "ion" being split aways from words
      like "produktION/konstruktION" etc:

      =# select * from ts_debug('public.german_compound_ispell',
      'konstruktion');+

         alias   |   description   |    token     |       
      dictionaries         |  dictionary   |          
      lexemes            

-----------+-----------------+--------------+-----------------------------+---------------+------------------------------

       asciiword | Word, all ASCII | konstruktion |
      {german_ispell,german_stem} | german_ispell |
      {konstruktion,konstrukt,ion}

      The splitting of compound words is unfortunately not consistent
      (wasserkraft vs konstruktionsplan):

      =# select * from ts_debug('public.german_compound_ispell',
      'wasserkraft');

         alias   |   description   |    token    |       
      dictionaries         |  dictionary   |          lexemes          

-----------+-----------------+-------------+-----------------------------+---------------+----------------------------

       asciiword | Word, all ASCII | wasserkraft |
      {german_ispell,german_stem} | german_ispell |
      {wasserkraft,wasser,kraft}

      =# select * from ts_debug('public.german_compound_ispell',
      'konstruktionsplan');

         alias   |   description   |       token       |       
      dictionaries         |  dictionary   |       lexemes       

-----------+-----------------+-------------------+-----------------------------+---------------+---------------------

       asciiword | Word, all ASCII | konstruktionsplan |
      {german_ispell,german_stem} | german_ispell | {konstruktion,plan}

      Not sure how the 'sch' come to be:

      =# select * from ts_debug('public.german_compound_ispell',
      'rundflansch');

         alias   |   description   |    token    |       
      dictionaries         |  dictionary   |          
      lexemes            

-----------+-----------------+-------------+-----------------------------+---------------+------------------------------

       asciiword | Word, all ASCII | rundflansch |
      {german_ispell,german_stem} | german_ispell |
      {rund,flansch,rund,flan,sch}

      This is another funny example:

      =# select * from ts_debug('public.german_compound_ispell',
      'datenbanken');

         alias   |   description   |    token    |       
      dictionaries         |  dictionary  
      |                                    
      lexemes                                     

-----------+-----------------+-------------+-----------------------------+---------------+---------------------------------------------------------------------------------

       asciiword | Word, all ASCII | datenbanken |
      {german_ispell,german_stem} | german_ispell |
{datenbank,daten,date,banken,daten,date,bank,daten,date,banken,daten,date,bank}

      On 01.06.2015 09:25, Sven R. Kunze wrote:

      I actually wanted to minimize the
        installation effort. Thus, I used the hunspell-de-de package of
        Debian/Ubuntu.

        Give me a second for ispell.

        Below, see the hunspell variant for
        Produktionsintervall/Produktionintervall:

        =# select * from ts_debug('public.german_compound',
        'Produktionsintervall');

           alias   |   description   |        token         |        
        dictionaries          | dictionary  |        lexemes         

-----------+-----------------+----------------------+-------------------------------+-------------+------------------------

         asciiword | Word, all ASCII | Produktionsintervall |
        {german_hunspell,german_stem} | german_stem |
        {produktionsintervall}

        (1 row)

        =# select * from ts_debug('public.german_compound',
        'Produktionintervall');

           alias   |   description   |        token        |        
        dictionaries          | dictionary  |        lexemes        

-----------+-----------------+---------------------+-------------------------------+-------------+-----------------------

         asciiword | Word, all ASCII | Produktionintervall |
        {german_hunspell,german_stem} | german_stem |
        {produktionintervall}

        PS: I post your answer to the list as well

        On 28.05.2015 19:42, Oleg Bartunov wrote:

          For readability it's better to use 

            select * from ts_debug

          I remember there is problem with correct support of hunspell
          files. Did you try ispell files ?

            Also, I found this message http://www.postgresql.org/message-id/dm1ece$2gb5$1@xxxxxxxxxxxx

Try this word - Produktionintervall

          On Thu, May 28, 2015 at 6:34 PM, Sven
            R. Kunze <srkunze@xxxxxxxxxxxx>
            wrote:

                Sure. Here you are:

                  =# select ts_debug('public.german_compound',
                  'wasserkraft');

                  ts_debug                                              

-----------------------------------------------------------------------------------------------------

                   (asciiword,"Word, all
ASCII",wasserkraft,"{german_hunspell,german_stem}",german_stem,{wasserkraft})

                  =# select ts_debug('public.german_compound',
                  'schifffahrt');

                  ts_debug                                                

---------------------------------------------------------------------------------------------------------

                   (asciiword,"Word, all
ASCII",schifffahrt,"{german_hunspell,german_stem}",german_hunspell,{schifffahrt})

                  =# select ts_debug('public.german_compound',
                  'blindflansch');

                  ts_debug                                               

-------------------------------------------------------------------------------------------------------

                   (asciiword,"Word, all
ASCII",blindflansch,"{german_hunspell,german_stem}",german_stem,{blindflansch})

                  That is my testing configuration:

                  =# \dF+ german_compound

                  Text search configuration "public.german_compound"

                  Parser: "pg_catalog.default"

                        Token      |        Dictionaries         

                  -----------------+-----------------------------

                   asciihword      | german_hunspell,german_stem

                   asciiword       | german_hunspell,german_stem

                   email           | simple

                   file            | simple

                   float           | simple

                   host            | simple

                   hword           | german_hunspell,german_stem

                   hword_asciipart | german_hunspell,german_stem

                   hword_numpart   | simple

                   hword_part      | german_hunspell,german_stem

                   int             | simple

                   numhword        | simple

                   numword         | simple

                   sfloat          | simple

                   uint            | simple

                   url             | simple

                   url_path        | simple

                   version         | simple

                   word            | german_hunspell,german_stem

                      On 28.05.2015 17:24, Oleg Bartunov wrote:

                      ts_debug() ?

                        =# select * from ts_debug('english',
                        'messages');

                           alias   |   description   |  token   | 
                        dictionaries  |  dictionary  | lexemes

-----------+-----------------+----------+----------------+--------------+----------

                         asciiword | Word, all ASCII | messages |
                        {english_stem} | english_stem | {messag}

                        On Thu, May 28, 2015 at
                          2:05 PM, Sven R. Kunze <srkunze@xxxxxxxxxxxx>
                          wrote:

                          Hi everybody,

                            what do I need to do in order to enable
                            compound word handling in PostgreSQL
                            tsvector implementation?

                            I run an Ubuntu 14.04 machine, PostgreSQL
                            9.3, have installed package hunspell-de-de
                            and already created a new dictionary as
                            described here: http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY

                            CREATE TEXT SEARCH DICTIONARY
                            german_hunspell (

                                TEMPLATE = ispell,

                                DictFile = de_de,

                                AffFile = de_de,

                                StopWords = german

                            );

                            Furthermore, created a new test text search
                            configuration (copied from german) and
                            updated all parser parts where the
                            german_stem dictionary is used so that it
                            uses german_hunspell first and then
                            german_stem.

                            However, ts_vector still does not work for
                            the compound words such as:

                            wasserkraft -> wasserkraft, kraft

                            schifffahrt -> schifffahrt, fahrt

                            blindflansch -> blindflansch, flansch

                            etc.

                            What have I done wrong here?

                                -- 

                                Sven R. Kunze

                                TBZ-PARIV GmbH, Bernsdorfer Str.
                                210-212, 09126 Chemnitz

                                Tel: +49 (0)371 33714721, Fax: +49
                                (0)371 5347920

                                e-mail: srkunze@xxxxxxxxxxxx

                                web: www.tbz-pariv.de

                                Geschäftsführer: Dr. Reiner Wohlgemuth

                                Sitz der Gesellschaft: Chemnitz

                                Registergericht: Chemnitz HRB 8543

                                -- 

                                Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)

                                To make changes to your subscription:

                                http://www.postgresql.org/mailpref/pgsql-general

                    -- 
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srkunze@xxxxxxxxxxxx
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543

      -- 
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srkunze@xxxxxxxxxxxx
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543

    -- 
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srkunze@xxxxxxxxxxxx
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543