Hello,
I'm using a GIN index for a text column on a big table. I use it to rank
the rows, but I also need to get the term positions for each document of
a subset of documents. I assume these positions are stored in the index,
because doc says positions can be used for cover density ranking and
because to_tsvector function gives them :
select * from to_tsvector('I get lexemes and I get term positions.');
to_tsvector
----------------------------------------
'get':2,6 'lexem':3 'posit':8 'term':7
I can get the term positions with to_tsvector but only by parsing the
result string, is there any more handy way ? Something like :
select * from term_and_positions('I get lexemes and I get term positions.');
term | positions
---------+-----------
'get' | {2,6}
'lexem' | {3}
Then, from the term positions, I need to get the character offset of
these term positions. I assume it is NOT stored in the GIN index. By
character offset I mean the character count from string begining to the
term. For the previous example it would be : 'get' --> {2,20}.
I thought about using ts_headline to return the whole text with term
tagged and then parse it to compute the character offsets from the tags.
But this function is very slow, seems like it does not use the GIN index
at all. And I suppose it can't because there is no way to know from a
term position where its substring is in the text.
Now I think the only solution is to make my own C function parsing the
text like to_tsvector does and counting terms AND characters read from
the begining of the text to match them. I got a look on the code, and it
does not seems easy to do because characters offset or string length are
never used by the parsetext function (ts_parse.c). If you have any other
suggestion, would love to hear that !
Regards, Yoann Moreau
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general