Re: FTS with more than one language in body and with unknown query language?

Stefan Keller <sfkeller@xxxxxxxxx> · Fri, 15 Jul 2016 20:34:16 +0200

Artur

2016-07-15 11:02 GMT+02:00 Artur Zakirov <a.zakirov@xxxxxxxxxxxxxx>:
...
> This solution does not take into account the fact that queries in german can
> be in various forms (for example, plural forms). It works well for english.
> As a solution of this, you can put into geo.ths various forms of word.
>
> But maybe the solution above is what you need. Otherwise it can serve as a
> concept of solution.

Many thanks for this helpful input!
I actually expect that stemming takes place for english and german.
And we will in fact have queries in english and in german as well.
So I think we still have some issues to resolve...?

:Stefan

2016-07-15 11:02 GMT+02:00 Artur Zakirov <a.zakirov@xxxxxxxxxxxxxx>:
> Hello, Stefan!
>
> On 15.07.2016 01:54, Stefan Keller wrote:
>>
>> приве́т! Artur
>>
>> Thanks for your explanations.
>>
>> 2016-07-14 17:20 GMT+02:00 Artur Zakirov <a.zakirov@xxxxxxxxxxxxxx>:
>>>
>>> On 14.07.2016 01:16, Stefan Keller wrote:
>>
>> ...
>>>>
>>>> * Should I create a synonym dictionary which contains word
>>>> translations en-de instead of synonyms en-en?
>>>
>>>
>>> This synonym dictionary will contain a thousands entries. So it will
>>> require
>>> a great effort to make this dictionary.
>>
>>
>> It's a domain-specific corpus of max. 1000 records of descriptive text
>> (metadata) about geographic data, like topographic map, land use
>> planning, etc.
>
>
> Oh, I understand. If you need word stemming you can use a thesaurus
> dictionary. So you should do the following.
>
> 1 - Create the file $SHAREDIR/tsearch_data/geo.ths. Here is my example of
> entry:
>
> wald : forest
>
> 2 - PostgreSQL already has english_stem dictionary. In this solution you do
> not need german_stem dictionary. Create the dictionary and the configuration
> in PostgreSQL:
>
> =# CREATE TEXT SEARCH DICTIONARY geo_ths (
>     Template = thesaurus,
>     DictFile = geo,
>     Dictionary = pg_catalog.english_stem);
> =# CREATE TEXT SEARCH CONFIGURATION geo_ths(COPY='simple');
> =# ALTER TEXT SEARCH CONFIGURATION geo_ths
>     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
>         word, hword, hword_part
>     WITH geo_ths, english_stem;
>
> For example, we have the following data:
>
> =# CREATE TABLE geo (id int, body_en text, body_de text);
> =# INSERT INTO geo VALUES (1, 'forest', NULL), (2, NULL, 'wald');
>
> Then you can execute the following queries:
>
> =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@
> to_tsquery('geo_ths', 'forest');
>  id | body_en | body_de
> ----+---------+---------
>   1 | forest  | (null)
> (1 row)
>
> =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@
> to_tsquery('geo_ths', 'forest');
>  id | body_en | body_de
> ----+---------+---------
>   2 | (null)  | wald
> (1 row)
>
> =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@
> to_tsquery('geo_ths', 'wald');
>  id | body_en | body_de
> ----+---------+---------
>   1 | forest  | (null)
> (1 row)
>
> =# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@
> to_tsquery('geo_ths', 'wald');
>  id | body_en | body_de
> ----+---------+---------
>   2 | (null)  | wald
> (1 row)
>
> This solution does not take into account the fact that queries in german can
> be in various forms (for example, plural forms). It works well for english.
> As a solution of this, you can put into geo.ths various forms of word.
>
> But maybe the solution above is what you need. Otherwise it can serves as a
> concept of solution.
>
>
>>
>> ...
>>>>
>>>> * How to setup a text search configuration which e.g. stems en and de
>>>> words?
>>
>>
>> I still would like to give FTS a try with synonym dictionary (en-de).
>> Now, I'm wondering how to setup the configuration. I've seen examples
>> to process either english, german or russian alone. But I did not find
>> yet any documentation on how to setup the text search configuration
>> where a corpus contains two (or more) languages at same time in a
>> table (body_en and body_de).
>>
>> :Stefan
>>
>>
>>
>>
>>
>>
>>
>> 2016-07-14 17:20 GMT+02:00 Artur Zakirov <a.zakirov@xxxxxxxxxxxxxx>:
>>>
>>> Hi,
>>>
>>> On 14.07.2016 01:16, Stefan Keller wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I have a text corpus which contains either German or English docs and
>>>> I expect queries where I don't know if it's German or English. So I'd
>>>> like e.g. that a query "forest" matches "forest" in body_en but also
>>>> "Wald" in body_de.
>>>>
>>>> I created a table with attributes body_en and body_de (type "text"). I
>>>> will use ts_vector/ts_query on the fly (don't need yet an index
>>>> (attributes)).
>>>>
>>>> * Can FTS handle this multilingual situation?
>>>
>>>
>>>
>>> In my opinion, PostgreSQL cant handle it. It cant translate words from
>>> one
>>> language to another, it just stems word from original form to basic form.
>>> First you need to translate word from English to German, then search word
>>> in
>>> the body_de attribute.
>>>
>>> And the issue is complicated by the fact that one word could have
>>> different
>>> meaning in the other language.
>>>
>>>> * How to setup a text search configuration which e.g. stems en and de
>>>> words?
>>>> * Should I create a synonym dictionary which contains word
>>>> translations en-de instead of synonyms en-en?
>>>
>>>
>>>
>>> This synonym dictionary will contain a thousands entries. So it will
>>> require
>>> a great effort to make this dictionary.
>>>
>>>
>>>> * Any hints to related work where FTS has been used in a multilingual
>>>> context?
>>>>
>>>> :Stefan
>>>>
>>>>
>>>
>>> --
>>> Artur Zakirov
>>> Postgres Professional: http://www.postgrespro.com
>>> Russian Postgres Company
>
>
>
> --
> Artur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general