Re: [tsvector] to_tsvector called multiple times

"Sven R. Kunze" <srkunze@xxxxxxxxxxxx> · Tue, 26 May 2015 11:47:43 +0200

Thanks Albe for that detailed answer.

On 26.05.2015 11:01, Albe Laurenz wrote:
Sven R. Kunze wrote:
the following stemming results made me curious:

select to_tsvector('german', 'systeme'); > 'system':1
select to_tsvector('german', 'systemes'); > 'system':1
select to_tsvector('german', 'systems'); > 'system':1
select to_tsvector('german', 'systemen'); > 'system':1
select to_tsvector('german', 'system'); >  'syst':1

First of all, this seems to be a bug in the German stemmer. Where can I
fix it?
As far as I understand, the stemmer is not perfect, it is just a "best
effort" at German stemming.  It does not have a dictionary of valid German
words, but uses an algorithm based on only the occurring letters.

This web page describes the algorithm:
http://snowball.tartarus.org/algorithms/german/stemmer.html
I guess that the Snowball folks (and PostgreSQL) would be interested
if you could come up with a better algorithm.

Thanks for that hint. I will go to 
https://github.com/snowballstem/snowball/issues and try to explain my 
problem there.

However, are you sure, I am using snowball? Maybe, I am reading the 
documenation wrong: 
http://www.postgresql.org/docs/9.3/static/textsearch-dictionaries.html 
but it seems as it depends on which packages (ispell, hunspell, myspell, 
snowball + corresponding languages) my system has installed.

Is there an easy way to determine which of these packages PostgreSQL 
uses AND what for?

In this specific case, the stemmer goes wrong because "System" is a
foreign word whose ending is atypical for German.  The algorithm cannot
distinguish between "System" and, say, "lautem" or "bestem".

Second, and more importantly, as I understand it, the stemmed version of
a word should be considered normalized. That is, all other versions of
that stem should be mapped to it as well. The interesting problem here
is that PostgreSQL maps the stem itself ('system') to a completely
different stem ('syst').

Should a stem not remain stable even when to_tsvector is called on it
multiple times?
That's a possible position, but consider that a stem is not necessarily
a valid German word.  If you treat it as a German word (by stemming it),
the results might not be what you desire.

For example:

test=> select to_tsvector('german', 'linsen');
  to_tsvector
-------------
  'lins':1
(1 row)

test=> select to_tsvector('german', 'lins');
  to_tsvector
-------------
  'lin':1
(1 row)

Sure. That might be the problem. It occurs to me that stems (if detected 
as such) should be left alone.
In case a stem is real German word, it should be stemmed to itself anyway
If not, it might help not to stem in order to avoid errors.

I guess that your real problem here is that a search for "system"
will not find "systeme", which is indeed unfortunate.
But until somebody can come up with a better stemming algorithm, cases
like that can always occur.

Yours,
Laurenz Albe
This might pose a problem in the future of course. Thanks for pointing 
this out as well.

Regards,

--
Sven R. Kunze
TBZ-PARIV GmbH, Bernsdorfer Str. 210-212, 09126 Chemnitz
Tel: +49 (0)371 33714721, Fax: +49 (0)371 5347920
e-mail: srkunze@xxxxxxxxxxxx
web: www.tbz-pariv.de

Geschäftsführer: Dr. Reiner Wohlgemuth
Sitz der Gesellschaft: Chemnitz
Registergericht: Chemnitz HRB 8543

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general