Re: tsvector limitations - why and how

Stanislav Kozlovski <Stanislav_Kozlovski@xxxxxxxxxxx> · Thu, 29 Aug 2024 13:53:35 +0000

Thanks for the reply, Tom.

Makes sense to me.

Worth saying - one very large misunderstanding that was pointed out to me was that the position limit is not per character - it is per word. This makes sense given the parser parses per word - but I completely missed it. It basically completely changes my calculations:

> If I want to store a whole books' content - like PostgreSQL: Up and Running (2012) - I'd need to store it over 30 rows. (it's 300 pages long, 300-page books average about 82500 words, English words average about 6.5-4 characters, meaning a tsvector will hold
 the positions of no more than [2520-3277] words).

to having to store 8250 words per chapter, meaning just 10 rows as each chapter would cleanly fit into a tsvector. (assuming the max position per lexeme isn't hit)

Meaning the limitations aren't as egregious as I first thought. This further explains to me why there hasn't been much interest by others to expand the limits.

From: Tom Lane <tgl@xxxxxxxxxxxxx>

Sent: Wednesday, 28 August 2024 0:24

To: Stanislav Kozlovski <Stanislav_Kozlovski@xxxxxxxxxxx>

Cc: pgsql-general@xxxxxxxxxxxxxxxxxxxx <pgsql-general@xxxxxxxxxxxxxxxxxxxx>

Subject: Re: tsvector limitations - why and how

Stanislav Kozlovski <Stanislav_Kozlovski@xxxxxxxxxxx> writes:

> I was aware of the limitations of FTS<https://www.postgresql.org/docs/17/textsearch-limitations.html> and tried to ensure I didn't hit any - but what I missed was that the maximum
 allowed lexeme position was 16383 and everything above silently gets set to 16383. I was searching for a phrase (two words) at the end of the book and couldn't find it. After debugging I realized that my phrase's lexemes were being set to 16383, which was
 inaccurate.

> ...

> The problem I had is that it breaks FOLLOWED BY queries, essentially stopping you from being able to match on phrases (more than one word) on large text.

Yeah.  FOLLOWED BY didn't exist when the tsvector storage

representation was designed, so the possible inaccuracy of the

lexeme positions wasn't such a big deal.

> Why is this still the case?

Because nobody's done the significant amount of work needed to make

it better.  I think an acceptable patch would have to support both

the current tsvector representation and a "big" version that's able

to handle anything up to the 1GB varlena limit.  (If you were hoping

for documents bigger than that, you'd be needing a couple more

orders of magnitude worth of work.)  We might also find that there

are performance bottlenecks that'd have to be improved, but even just

making the code cope with two representations would be a big patch.

There has been some cursory talk about this, I think, but I don't

believe anyone's actually worked on it since the 2017 patch you

mentioned.  I'm not sure if that patch is worth using as the basis

for a fresh try: it looks like it had some performance issues, and

AFAICS it didn't really improve the lexeme-position limit.

(Wanders away wondering if the expanded-datum infrastructure could

be exploited here...)

                        regards, tom lane