Search Postgresql Archives

Re: FTS with more than one language in body and with unknown query language?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello, Stefan!

On 15.07.2016 01:54, Stefan Keller wrote:
приве́т! Artur

Thanks for your explanations.

2016-07-14 17:20 GMT+02:00 Artur Zakirov <a.zakirov@xxxxxxxxxxxxxx>:
On 14.07.2016 01:16, Stefan Keller wrote:
...
* Should I create a synonym dictionary which contains word
translations en-de instead of synonyms en-en?

This synonym dictionary will contain a thousands entries. So it will require
a great effort to make this dictionary.

It's a domain-specific corpus of max. 1000 records of descriptive text
(metadata) about geographic data, like topographic map, land use
planning, etc.

Oh, I understand. If you need word stemming you can use a thesaurus dictionary. So you should do the following.

1 - Create the file $SHAREDIR/tsearch_data/geo.ths. Here is my example of entry:

wald : forest

2 - PostgreSQL already has english_stem dictionary. In this solution you do not need german_stem dictionary. Create the dictionary and the configuration in PostgreSQL:

=# CREATE TEXT SEARCH DICTIONARY geo_ths (
    Template = thesaurus,
    DictFile = geo,
    Dictionary = pg_catalog.english_stem);
=# CREATE TEXT SEARCH CONFIGURATION geo_ths(COPY='simple');
=# ALTER TEXT SEARCH CONFIGURATION geo_ths
    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
        word, hword, hword_part
    WITH geo_ths, english_stem;

For example, we have the following data:

=# CREATE TABLE geo (id int, body_en text, body_de text);
=# INSERT INTO geo VALUES (1, 'forest', NULL), (2, NULL, 'wald');

Then you can execute the following queries:

=# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@ to_tsquery('geo_ths', 'forest');
 id | body_en | body_de
----+---------+---------
  1 | forest  | (null)
(1 row)

=# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@ to_tsquery('geo_ths', 'forest');
 id | body_en | body_de
----+---------+---------
  2 | (null)  | wald
(1 row)

=# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_en) @@ to_tsquery('geo_ths', 'wald');
 id | body_en | body_de
----+---------+---------
  1 | forest  | (null)
(1 row)

=# SELECT * FROM geo WHERE to_tsvector('geo_ths', body_de) @@ to_tsquery('geo_ths', 'wald');
 id | body_en | body_de
----+---------+---------
  2 | (null)  | wald
(1 row)

This solution does not take into account the fact that queries in german can be in various forms (for example, plural forms). It works well for english. As a solution of this, you can put into geo.ths various forms of word.

But maybe the solution above is what you need. Otherwise it can serves as a concept of solution.


...
* How to setup a text search configuration which e.g. stems en and de
words?

I still would like to give FTS a try with synonym dictionary (en-de).
Now, I'm wondering how to setup the configuration. I've seen examples
to process either english, german or russian alone. But I did not find
yet any documentation on how to setup the text search configuration
where a corpus contains two (or more) languages at same time in a
table (body_en and body_de).

:Stefan







2016-07-14 17:20 GMT+02:00 Artur Zakirov <a.zakirov@xxxxxxxxxxxxxx>:
Hi,

On 14.07.2016 01:16, Stefan Keller wrote:

Hi,

I have a text corpus which contains either German or English docs and
I expect queries where I don't know if it's German or English. So I'd
like e.g. that a query "forest" matches "forest" in body_en but also
"Wald" in body_de.

I created a table with attributes body_en and body_de (type "text"). I
will use ts_vector/ts_query on the fly (don't need yet an index
(attributes)).

* Can FTS handle this multilingual situation?


In my opinion, PostgreSQL cant handle it. It cant translate words from one
language to another, it just stems word from original form to basic form.
First you need to translate word from English to German, then search word in
the body_de attribute.

And the issue is complicated by the fact that one word could have different
meaning in the other language.

* How to setup a text search configuration which e.g. stems en and de
words?
* Should I create a synonym dictionary which contains word
translations en-de instead of synonyms en-en?


This synonym dictionary will contain a thousands entries. So it will require
a great effort to make this dictionary.


* Any hints to related work where FTS has been used in a multilingual
context?

:Stefan



--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux