Search Postgresql Archives

Re: [tsvector] to_tsvector called multiple times

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sven R. Kunze wrote:
> the following stemming results made me curious:
> 
> select to_tsvector('german', 'systeme'); > 'system':1
> select to_tsvector('german', 'systemes'); > 'system':1
> select to_tsvector('german', 'systems'); > 'system':1
> select to_tsvector('german', 'systemen'); > 'system':1
> select to_tsvector('german', 'system'); >  'syst':1
> 
> 
> First of all, this seems to be a bug in the German stemmer. Where can I
> fix it?

As far as I understand, the stemmer is not perfect, it is just a "best
effort" at German stemming.  It does not have a dictionary of valid German
words, but uses an algorithm based on only the occurring letters.

This web page describes the algorithm:
http://snowball.tartarus.org/algorithms/german/stemmer.html
I guess that the Snowball folks (and PostgreSQL) would be interested
if you could come up with a better algorithm.

In this specific case, the stemmer goes wrong because "System" is a
foreign word whose ending is atypical for German.  The algorithm cannot
distinguish between "System" and, say, "lautem" or "bestem".

> Second, and more importantly, as I understand it, the stemmed version of
> a word should be considered normalized. That is, all other versions of
> that stem should be mapped to it as well. The interesting problem here
> is that PostgreSQL maps the stem itself ('system') to a completely
> different stem ('syst').
> 
> Should a stem not remain stable even when to_tsvector is called on it
> multiple times?

That's a possible position, but consider that a stem is not necessarily
a valid German word.  If you treat it as a German word (by stemming it),
the results might not be what you desire.

For example:

test=> select to_tsvector('german', 'linsen');
 to_tsvector
-------------
 'lins':1
(1 row)

test=> select to_tsvector('german', 'lins');
 to_tsvector
-------------
 'lin':1
(1 row)

I guess that your real problem here is that a search for "system"
will not find "systeme", which is indeed unfortunate.
But until somebody can come up with a better stemming algorithm, cases
like that can always occur.

Yours,
Laurenz Albe

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux