Search Postgresql Archives

does ispell have allaffixes set to on?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I was testing the ispell text search dictionary and it appears to be behaving as if the ispell option "allaffixes" was set to "on". This wasn't the case for the original tsearch2 contrib module, and for the ispell program itself which defaults to "off".

So for example, if I create a simple DictFile with an entry for the word "brand" (brand/DGRS) and a simple english affix AffFile that does those standard ispell suffixes (*D > ED, *G > ING, *R > ER and *S > S) along with the standard ispell prefixes (*A: . > RE, *I: . > IN, *U . > UN) then the ispell dictionary will return a lexeme for any input token containing a suffix and one of those prefixes even though NONE of the prefixes have been listed in the dictionary file as active for that word.

The following is observed and expected:

mydb=> CREATE TEXT SEARCH DICTIONARY test_ispell (
    TEMPLATE  = ispell,
    DictFile  = test,
    AffFile   = test,
    StopWords = english );

mydb=> SELECT
    ts_lexize('test_ispell', 'branding')  AS sfx_yes,
    ts_lexize('test_ispell', 'brandest')  AS sfx_no,
    ts_lexize('test_ispell', 'notindict') AS dict_no,
    ts_lexize('test_ispell', 'rebrand')   AS pfx_no;
 sfx_yes | sfx_no | dict_no | pfx_no
---------+--------+---------+--------
 {brand} |        |         |
(1 row)


However, the following results are NOT expected:

mydb=> SELECT
    ts_lexize('test_ispell', 'unbranded')  AS sfx_wpfx1,
    ts_lexize('test_ispell', 'rebranding') AS sfx_wpfx2;
 sfx_wpfx1 | sfx_wpfx2
-----------+-----------
 {brand}   | {brand}
(1 row)

In that second statement I expect NULL values indicating that the tokens are unknown, rather than lexemes indicating a match. Is this expected behavior or a bug, and is there any way to control this? Before I try to patch this in the code I'd like to know if it's intentional behavior or not.

It gets even screwier if you add "rebrand" to the dictionary (e.g. rebrand/DGS).
Then ts_lexize('test_ispell', 'rebranding') returns an array of both lexemes "{rebrand,brand}", when only the first is anticipated and wanted.

Thanks,

Brian Carp

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux