Re: [EXT] Re: Looking for tips on improving full-text search quality in Postgres

"Bayer, Samuel" <sam@xxxxxxxxx> · Fri, 4 Mar 2022 11:39:39 -0500

I've tried both ranking functions. I've tried a variety of the normalization settings. I'm using the standard English language configuration. Postgres 13.

I do understand your FTS philosophy - I suppose I'm looking for guidance about how best to approximate the search capability in Solr using the FTS pieces you have. One concrete question, I suppose, is: the classic TF/IDF search strategy relies on inverse document frequency, which looks across the corpus. I can't tell whether that corpus-wide frequency information is taken into account in either ranking function.

I don't know if Solr weights earlier tokens more heavily, but I wouldn't be surprised if it does.

On 3/4/22 11:09 AM, Tom Lane wrote:
Bruce Momjian <bruce@xxxxxxxxxx> writes:
On Fri, Mar 4, 2022 at 10:41:16AM -0500, Bayer, Samuel wrote:
I apologize for not being able to be more specific.

I know it is hard to quantify.  Is it possible that Postgres is treating
all the terms equally, while Solr is prioritizing terms that are earlier
in the document?

A few basic questions:

* which ranking function are you using?

* with what options?

* which PG version exactly?

As far as I can see from a quick look at the docs, neither
ts_rank() nor ts_rank_cd() consider "earlier in the document"
to be an interesting consideration.  They do have the ability
to prefer terms that have been marked as having a higher weight,
but you'd need to do some setup work to make that useful ---
basically, you'd have to separate out the title or other metadata
and apply setweight() to it while building the tsvectors.

I wouldn't be surprised if Solr has some well-tuned default
heuristics that mean that you don't have to work hard to get
good results from it.  The current state of our FTS features
is more like "here's all the parts, but you have to build the
behavior you want".

ISTM that our FTS features have basically been on autopilot
since they went in.  I'd sort of hoped that we'd see more
parsers, more ranking functions, etc, over time ... but nothing
like that has happened.  I'm not sure if that's just lack of
interest, or if people find the code too difficult to work with.

			regards, tom lane