Artur 2016-07-15 11:02 GMT+02:00 Artur Zakirov <a.zakirov@xxxxxxxxxxxxxx>: ... > This solution does not take into account the fact that queries in german can > be in various forms (for example, plural forms). It works well for english. > As a solution of this, you can put into geo.ths various forms of word. > > But maybe the solution above is what you need. Otherwise it can serve as a > concept of solution. Many thanks for this helpful input! I actually expect that stemming takes place for english and german. And we will in fact have queries in english and in german as well. So I think we still have some issues to resolve...? :Stefan 2016-07-15 11:02 GMT+02:00 Artur Zakirov <a.zakirov@xxxxxxxxxxxxxx>: > Hello, Stefan! > > On 15.07.2016 01:54, Stefan Keller wrote: >> >> приве́т! Artur >> >> Thanks for your explanations. >> >> 2016-07-14 17:20 GMT+02:00 Artur Zakirov <a.zakirov@xxxxxxxxxxxxxx>: >>> >>> On 14.07.2016 01:16, Stefan Keller wrote: >> >> ... >>>> >>>> * Should I create a synonym dictionary which contains word >>>> translations en-de instead of synonyms en-en? >>> >>> >>> This synonym dictionary will contain a thousands entries. So it will >>> require >>> a great effort to make this dictionary. >> >> >> It's a domain-specific corpus of max. 1000 records of descriptive text >> (metadata) about geographic data, like topographic map, land use >> planning, etc. > > > Oh, I understand. If you need word stemming you can use a thesaurus > dictionary. So you should do the following. > > 1 - Create the file $SHAREDIR/tsearch_data/geo.ths. Here is my example of > entry: > > wald : forest > > 2 - PostgreSQL already has english_stem dictionary. In this solution you do > not need german_stem dictionary. Create the dictionary and the configuration > in PostgreSQL: > > =# CREATE TEXT SEARCH DICTIONARY geo_ths ( > Template = thesaurus, > DictFile = geo, > Dictionary = pg_catalog.english_stem); > =# CREATE TEXT SEARCH CONFIGURATION geo_ths(COPY='simple'); > =# ALTER TEXT SEARCH CONFIGURATION geo_ths > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, > word, hword, hword_part > WITH geo_ths, english_stem; > > For example, we have the following data: > > =# CREATE TABLE geo (id int, body_en text, body_de text); > =# INSERT INTO geo VALUES (1, 'forest', NULL), (2, NULL, 'wald'); > > Then you can execute the following queries: > > =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@ > to_tsquery('geo_ths', 'forest'); > id | body_en | body_de > ----+---------+--------- > 1 | forest | (null) > (1 row) > > =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@ > to_tsquery('geo_ths', 'forest'); > id | body_en | body_de > ----+---------+--------- > 2 | (null) | wald > (1 row) > > =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@ > to_tsquery('geo_ths', 'wald'); > id | body_en | body_de > ----+---------+--------- > 1 | forest | (null) > (1 row) > > =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@ > to_tsquery('geo_ths', 'wald'); > id | body_en | body_de > ----+---------+--------- > 2 | (null) | wald > (1 row) > > This solution does not take into account the fact that queries in german can > be in various forms (for example, plural forms). It works well for english. > As a solution of this, you can put into geo.ths various forms of word. > > But maybe the solution above is what you need. Otherwise it can serves as a > concept of solution. > > >> >> ... >>>> >>>> * How to setup a text search configuration which e.g. stems en and de >>>> words? >> >> >> I still would like to give FTS a try with synonym dictionary (en-de). >> Now, I'm wondering how to setup the configuration. I've seen examples >> to process either english, german or russian alone. But I did not find >> yet any documentation on how to setup the text search configuration >> where a corpus contains two (or more) languages at same time in a >> table (body_en and body_de). >> >> :Stefan >> >> >> >> >> >> >> >> 2016-07-14 17:20 GMT+02:00 Artur Zakirov <a.zakirov@xxxxxxxxxxxxxx>: >>> >>> Hi, >>> >>> On 14.07.2016 01:16, Stefan Keller wrote: >>>> >>>> >>>> Hi, >>>> >>>> I have a text corpus which contains either German or English docs and >>>> I expect queries where I don't know if it's German or English. So I'd >>>> like e.g. that a query "forest" matches "forest" in body_en but also >>>> "Wald" in body_de. >>>> >>>> I created a table with attributes body_en and body_de (type "text"). I >>>> will use ts_vector/ts_query on the fly (don't need yet an index >>>> (attributes)). >>>> >>>> * Can FTS handle this multilingual situation? >>> >>> >>> >>> In my opinion, PostgreSQL cant handle it. It cant translate words from >>> one >>> language to another, it just stems word from original form to basic form. >>> First you need to translate word from English to German, then search word >>> in >>> the body_de attribute. >>> >>> And the issue is complicated by the fact that one word could have >>> different >>> meaning in the other language. >>> >>>> * How to setup a text search configuration which e.g. stems en and de >>>> words? >>>> * Should I create a synonym dictionary which contains word >>>> translations en-de instead of synonyms en-en? >>> >>> >>> >>> This synonym dictionary will contain a thousands entries. So it will >>> require >>> a great effort to make this dictionary. >>> >>> >>>> * Any hints to related work where FTS has been used in a multilingual >>>> context? >>>> >>>> :Stefan >>>> >>>> >>> >>> -- >>> Artur Zakirov >>> Postgres Professional: http://www.postgrespro.com >>> Russian Postgres Company > > > > -- > Artur Zakirov > Postgres Professional: http://www.postgrespro.com > Russian Postgres Company -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general