Hey!
I just spent some time extensively documenting a draft e-mail I planned to send over here to report what I thought was a bug.
I was playing around with tsvector and full text search for a pet project of mine, ingesting books' content. I was aware of the limitations of FTS and tried to ensure I didn't hit any - but what I missed was that the maximum allowed lexeme position was 16383 and everything above silently gets set to 16383. I was searching for a phrase (two words) at the end of the book and couldn't find it. After debugging I realized that my phrase's lexemes were being set to 16383, which was inaccurate.
I acknowledge that this limitation still sets the single lexeme and makes it searchable. I commend the authors for its elegant approach of still providing functionality beyond the limitations that save space.
The problem I had is that it breaks FOLLOWED BY queries, essentially stopping you from being able to match on phrases (more than one word) on large text.
I have a few questions here:
The earliest I could see this limitation in place was from
PG 8.3, released 16 years ago (2008). Hardware, and in particular memory and storage, have gone a long way since.
Do these limits still make sense?
I saw
a thread from 6 years ago (2018) which asked a similar question, but it was focused on hitting the 1MB limit of the tsvector.
There, a member suggested that you'd probably need a whole new type (e.g bigtsvector) to make this work. Somebody else shared a project called
tsvector2 which was a beefed-up version.
There was also
a patch that attempted to remove the 1MB limit, but the contributor had it stall and never got it merged. I'm not aware if that would have fixed the positional limitation.
My understanding of the 16383 position limit is the following:
Is there anything else I'm missing?
If no - I'd be happy to contribute this piece of text to the docs - particularly in the limitations part, as I believe this would help people reading it.
If I want to store a whole books' content - like PostgreSQL: Up and Running (2012) - I'd need to store it over 30 rows. (it's 300 pages long, 300-page books average about 82500 words, English words average about 6.5-4 characters, meaning a tsvector will hold
the positions of no more than [2520-3277] words).
30 rows sounds fair, but there are two complications:
Are there any unofficial larger tsvector types? Am I doing anything wrong?
Thanks for reading this. I'd appreciate any response.
Best,
Stanislav |