Search Postgresql Archives

Re: ts_headline and query with hyphen

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/05/2012 04:49 AM, Tom Lane wrote:
daniel <dochtorek@xxxxxxxxx> writes:
I have a question about ts_headline, when the query includes word like
'on-line' - only the 'line' part is highlighted, even though the whole
phrase is indexed too, some details below.

Part of the reason is that "on" is a stop word (at least in the default
english dictionary).  That's why you get

select to_tsquery('play & on-line');
           to_tsquery
----------------------------
   'play' & 'on-lin' & 'line'

and not "'play' & 'on-lin' & 'on' & 'line'".  If you did get the latter
then you'd get a headline result with both parts highlighted, similar to
your "custom-built" case.


I understand the 'on' part, but still, 'on-lin' is passed to the ts_headline, so I thought that match would be preferred over 'line' and highlighted as a whole.

Additionally, with a specific value of MaxWords I could see a dangling "line" at the start of a headline ("on-" has been cut off), which is kinda troubling, because it's not even an English document. It doesn't seem to happen to queries like 'custom-built' - I can't see it being split neither in the beginning of a headline nor at the end.

Just to be clear - the headline with cut off "on-" is OK (having the matched stuff somewhere in the middle, though with highlighted 'line' only), it's just that the word 'on-line' is used multiple times in the doc and it happended to appear at the beginning of a headline. Cutting was not affected by ShortWord setting, so I guess it's a stopword thing again. If that's the case, then IMHO it should treat hyphenated words as 1 when creating the headline and not cut off like that. But maybe it was intended to work like that..

But maybe ts_headline understands or operates on
single, not hyphenated words only?

Dunno.  It would seem reasonable to highlight the whole compound in
these cases, but I have no idea how hard that is.


Right, although that latter case is easy to fix outside postgres and still looks fine - I've included it just as an example. Former causes a few problems in specific cases, I have to fix them manually now, word by word.

Another thing that seems a bit odd here is that we seem to be stemming
the compound word as a whole, but not the individual parts.  Not sure
how sane that combination of choices is ...


Good question, hope others will jump in.

thanks,
daniel



--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux