Phrase searching

Stephen Davies <scldad@xxxxxxxxxx> · Sat, 23 Feb 2008 22:32:38 +1030

As I understand it, the way that BASIS does phrase searching is based on first 
parsing the base text to "context units" (sentences and/or paragraphs) and 
then calculating position for tokens within those context units.

That is, a token might have position 3 in context unit 4. All of this is 
stored in the index.

There are then multiple operators: phrase any (any token in the query is 
matched in a context unit), phrase all (all tokens in the query match within 
a context unit), phrase is (all tokens match in order including any stop 
words), phrase like ( as for phrase is but with stop words only being 
position holders).

There is also an "includes" operator which supports queries such as:

includes "foo" & "bar" within 3 words

or 

includes "foo" & "bar" within 3 sentences

All of these plus hit highlighting are supported without reparsing the 
original text (which might be gigabytes); just using the information in the 
index.

Things like thesaurus expansion in queries are handled by adding AND/OR 
constructs.

All operators support wild cards in query terms.

HTH,
Stephen

On Saturday 23 February 2008 21:48, Oleg Bartunov wrote:
> On Sat, 23 Feb 2008, Stephen Davies wrote:
> > As it turns out, all I needed was in the doco but the key element - the
> > first config arg to ts_headline - was not in any of the examples so I
> > missed it.
>
> aha, Original one were based on default
> configuration, but then concept was changed, but the examples were not
> modified.
>
> > Would it be possible for ts_headline to work with the pre-parsed
> > ts_vector?
>
> it's impossible, Richard already explained you the reasons.
>
> > I see references to future plans for phrase searching in ts. Is there a
> > date for this?
>
> Not yet. The problem mostly algebraical :) Simple 'exact search' is doable,
> but we need something more, since we support boolean operators,
> pluggable dictionaries (which could produce several lexemes, for example),
> and document structure (lexem weights). So, we need to define consistent
> algebra for text, to have predictable results. This is quite a complex
> task, which require a lot of dedicated time, which we don't have.
>
> > Cheers and thanks,
> > Stephen
> > Davies
> >
> > On Friday 22 February 2008 22:54, Oleg Bartunov wrote:
> >> On Fri, 22 Feb 2008, Stephen Davies wrote:
> >>> Hmmmm!
> >>> I think I now understand the ts position better, thank you.
> >>>
> >>> Part of my problem has been that I am used to the functionality of Open
> >>> Text's LCS (aka BASIS) product which handles text differently.
> >>>
> >>> It includes the position (and context) information in the index and
> >>> does "remember" how the text was parsed so does not need to reparse to
> >>> insert hit navigation tags nor need pointers as to how to parse
> >>> queries. (It also supports phrase searching.)
> >>>
> >>> Now that I have a better understanding of ts, I think I will be able to
> >>> make it do at least most of what I hoped for.
> >>
> >> I'm wondering if it was not described in the text search documentation
> >> :)
> >>
> >>> Thank you again for your help with this.
> >>>
> >>> Cheers,
> >>> Stephen Davies
> >>>
> >>> On Friday 22 February 2008 20:45, Richard Huxton wrote:
> >>>> Stephen Davies wrote:
> >>>>> Unfortunately, my link to the box with the test database is down due
> >>>>> to lack of maintenance by our local telco (Telstra) but I think that
> >>>>> I also missed the optional config arg to ts_headline.
> >>>>>
> >>>>> The lack of link also means that I cannot confirm your findings but
> >>>>> your logic looks good.
> >>>>
> >>>> Looks like ALTER DATABASE SET default_text_config='english' is what
> >>>> you need.
> >>>>
> >>>>> It begs the question, however, as to why ts-headline needs to reparse
> >>>>> the raw text.
> >>>>
> >>>> It needs to line up tsvector lexemes with actual characters in the
> >>>> text. The tsvector is missing punctuation, any stopwords (the, it, a)
> >>>> as well as being stemmed (if your dictionary does that).
> >>>>
> >>>> Also, it's looking for a short span of words that provide the best
> >>>> match. That might not be a complete match of course, and is different
> >>>> to how you'd normally look to use a tsvector.
> >>>>
> >>>>> At least in my case, I am using a trigger to parse the combination of
> >>>>> Title and Abstract to a ts_vector field in the table row (as
> >>>>> suggested in 12.2.2 and 12.4.3 in the doco) so that the ts_vector is
> >>>>> already available to ts_headline.
> >>>>>
> >>>>> If ts_headline had the ability to use that pre-parsed ts_vector, my
> >>>>> problem would never have arisen - and the performance of ts_headline
> >>>>> would be improved.
> >>>>
> >>>> Maybe. It would still have to parse the text to some degree though,
> >>>> just to get the original words & punctuation into the headline.
> >>
> >>  	Regards,
> >>  		Oleg
> >> _____________________________________________________________
> >> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> >> Sternberg Astronomical Institute, Moscow University, Russia
> >> Internet: oleg@xxxxxxxxxx, http://www.sai.msu.su/~megera/
> >> phone: +007(495)939-16-83, +007(495)939-23-83
>
>  	Regards,
>  		Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@xxxxxxxxxx, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83

-- 
========================================================================
This email is for the person(s) identified above, and is confidential to
the sender and the person(s).  No one else is authorised to use or
disseminate this email or its contents.

Stephen Davies Consulting                            Voice: 08-8177 1595
Adelaide, South Australia.                             Fax: 08-8177 0133
Computing & Network solutions.                       Mobile:0403 0405 83

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster