Re: Full text: Ispell dictionary

Oleg Bartunov <obartunov@xxxxxxxxx> · Thu, 8 May 2014 00:00:00 +0400

btw, take a look on contrib/dict_xsyn, it's  more powerful than
synonym dictionary.

On Sat, May 3, 2014 at 2:26 AM, Tim van der Linden <tim@xxxxxxxxx> wrote:
> Hi Oleg
>
> Haha, understood!
>
> Thanks for helping me on this one.
>
> Cheers
> Tim
>
>
> On May 3, 2014 7:24:08 AM GMT+09:00, Oleg Bartunov <obartunov@xxxxxxxxx>
> wrote:
>>
>> Tim,
>>
>> you did answer yourself - don't use ispell :)
>>
>> On Sat, May 3, 2014 at 1:45 AM, Tim van der Linden <tim@xxxxxxxxx> wrote:
>>>
>>>  On Fri, 2 May 2014 21:12:56 +0400
>>>  Oleg Bartunov <obartunov@xxxxxxxxx> wrote:
>>>
>>>  Hi Oleg
>>>
>>>  Thanks for the response!
>>>
>>>>  Yes, it's normal for ispell dictionary, think about morphological
>>>> dictionary.
>>>
>>>
>>>  Hmm, I see, that makes sense. I thought the morphological aspect of the
>>> Ispell only dealt with splitting up compound words, but it also deals with
>>> deriving the word to a more "stem" like form, correct?
>>>
>>>  As a last question on this, is there a way to disable this dictionary to
>>> emit multiple lexemes?
>>>
>>>
>>> The reason I am asking is because in my (fairly new) understanding of
>>> PostgreSQL's full text it is always best to have as few lexemes as possible
>>> saved in the vector. This to get smaller indexes and faster matching
>>> afterwards. Also, if you run a tsquery afterwards to, you can still employ
>>> the power of these multiple lexemes to find a match.
>>>
>>>  Or...probably answering my own question...if I do not desire this
>>> behavior I should maybe not use Ispell and simply use another dictionary :)
>>>
>>>  Thanks again.
>>>
>>>  Cheers,
>>>  Tim
>>>
>>>>  On Fri, May 2, 2014 at 11:54 AM, Tim van der Linden <tim@xxxxxxxxx>
>>>> wrote:
>>>>>
>>>>>  Good morning/afternoon all
>>>>>
>>>>>  I am currently writing a few articles about PostgreSQL's full text
>>>>> capabilities and have a question about the Ispell dictionary which I
>>>>> cannot seem to find an answer to. It is probably a very simple issue, so
>>>>> forgive my ignorance.
>>>>>
>>>>>  In one article I am explaining about dictionaries and I have setup a
>>>>> sample configuration which maps most token categories to only use a Ispell
>>>>> dictionary (timusan_ispell) which has a default configuration:
>>>>>
>>>>>  CREATE TEXT SEARCH DICTIONARY timusan_ispell (
>>>>>          TEMPLATE = ispell,
>>>>>          DictFile = en_us,
>>>>>          AffFile = en_us,
>>>>>          StopWords = english
>>>>>  );
>>>>>
>>>>>  When I run a simple query like "SELECT
>>>>> to_tsvector('timusan-ispell','smiling')" I get back the following tsvector:
>>>>>
>>>>>  'smile':1 'smiling':1
>>>>>
>>>>>  As you can see I get two lexemes with the same pointer.
>>>>>  The question here is: why does this happen?
>>>>>
>>>>>  Is it normal behavior for the Ispell dictionary to emit multiple
>>>>> lexemes for a single token? And if so, is this efficient? I
>>>>> mean, why could it not simply save one lexeme 'smile' which (same as
>>>>> the snowball dictionary) would match 'smiling' as well if later matched with
>>>>> the accompanying tsquery?
>>>>>
>>>>>  Thanks!
>>>>>
>>>>>  Cheers,
>>>>>  Tim
>>>>>
>>>>>
>>>>>  --
>>>>>  Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
>>>>>  To make changes to your subscription:
>>>>>  http://www.postgresql.org/mailpref/pgsql-general
>>>
>>>
>>>
>>>  --
>>>  Tim van der Linden <tim@xxxxxxxxx>