Re: Full text: Ispell dictionary

Tim van der Linden <tim@xxxxxxxxx> · Sat, 10 May 2014 10:31:02 +0900

Hi Oleg

> btw, take a look on contrib/dict_xsyn, it's  more powerful than
> synonym dictionary.

Sorry for the late reply...and thank you for the tip.

I will check out xsyn soon. I am about to finish the third and final chapter of my full text series, but I could maybe write an "appendix" chapter which mentions xsyn...or just update my posts.

Cheers,
Tim

> On Sat, May 3, 2014 at 2:26 AM, Tim van der Linden <tim@xxxxxxxxx> wrote:
> > Hi Oleg
> >
> > Haha, understood!
> >
> > Thanks for helping me on this one.
> >
> > Cheers
> > Tim
> >
> >
> > On May 3, 2014 7:24:08 AM GMT+09:00, Oleg Bartunov <obartunov@xxxxxxxxx>
> > wrote:
> >>
> >> Tim,
> >>
> >> you did answer yourself - don't use ispell :)
> >>
> >> On Sat, May 3, 2014 at 1:45 AM, Tim van der Linden <tim@xxxxxxxxx> wrote:
> >>>
> >>>  On Fri, 2 May 2014 21:12:56 +0400
> >>>  Oleg Bartunov <obartunov@xxxxxxxxx> wrote:
> >>>
> >>>  Hi Oleg
> >>>
> >>>  Thanks for the response!
> >>>
> >>>>  Yes, it's normal for ispell dictionary, think about morphological
> >>>> dictionary.
> >>>
> >>>
> >>>  Hmm, I see, that makes sense. I thought the morphological aspect of the
> >>> Ispell only dealt with splitting up compound words, but it also deals with
> >>> deriving the word to a more "stem" like form, correct?
> >>>
> >>>  As a last question on this, is there a way to disable this dictionary to
> >>> emit multiple lexemes?
> >>>
> >>>
> >>> The reason I am asking is because in my (fairly new) understanding of
> >>> PostgreSQL's full text it is always best to have as few lexemes as possible
> >>> saved in the vector. This to get smaller indexes and faster matching
> >>> afterwards. Also, if you run a tsquery afterwards to, you can still employ
> >>> the power of these multiple lexemes to find a match.
> >>>
> >>>  Or...probably answering my own question...if I do not desire this
> >>> behavior I should maybe not use Ispell and simply use another dictionary :)
> >>>
> >>>  Thanks again.
> >>>
> >>>  Cheers,
> >>>  Tim
> >>>
> >>>>  On Fri, May 2, 2014 at 11:54 AM, Tim van der Linden <tim@xxxxxxxxx>
> >>>> wrote:
> >>>>>
> >>>>>  Good morning/afternoon all
> >>>>>
> >>>>>  I am currently writing a few articles about PostgreSQL's full text
> >>>>> capabilities and have a question about the Ispell dictionary which I
> >>>>> cannot seem to find an answer to. It is probably a very simple issue, so
> >>>>> forgive my ignorance.
> >>>>>
> >>>>>  In one article I am explaining about dictionaries and I have setup a
> >>>>> sample configuration which maps most token categories to only use a Ispell
> >>>>> dictionary (timusan_ispell) which has a default configuration:
> >>>>>
> >>>>>  CREATE TEXT SEARCH DICTIONARY timusan_ispell (
> >>>>>          TEMPLATE = ispell,
> >>>>>          DictFile = en_us,
> >>>>>          AffFile = en_us,
> >>>>>          StopWords = english
> >>>>>  );
> >>>>>
> >>>>>  When I run a simple query like "SELECT
> >>>>> to_tsvector('timusan-ispell','smiling')" I get back the following tsvector:
> >>>>>
> >>>>>  'smile':1 'smiling':1
> >>>>>
> >>>>>  As you can see I get two lexemes with the same pointer.
> >>>>>  The question here is: why does this happen?
> >>>>>
> >>>>>  Is it normal behavior for the Ispell dictionary to emit multiple
> >>>>> lexemes for a single token? And if so, is this efficient? I
> >>>>> mean, why could it not simply save one lexeme 'smile' which (same as
> >>>>> the snowball dictionary) would match 'smiling' as well if later matched with
> >>>>> the accompanying tsquery?
> >>>>>
> >>>>>  Thanks!
> >>>>>
> >>>>>  Cheers,
> >>>>>  Tim
> >>>>>
> >>>>>
> >>>>>  --
> >>>>>  Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
> >>>>>  To make changes to your subscription:
> >>>>>  http://www.postgresql.org/mailpref/pgsql-general
> >>>
> >>>
> >>>
> >>>  --
> >>>  Tim van der Linden <tim@xxxxxxxxx>

-- 
Tim van der Linden <tim@xxxxxxxxx>