GIN : Working with term positions

Yoann Moreau <yoann.moreau@xxxxxxxxxxxxxxx> · Fri, 21 Oct 2011 12:23:40 +0200

Hello,
I'm using a GIN index for a text column on a big table. I use it to rank 
the rows, but I also need to get the term positions for each document of 
a subset of documents. I assume these positions are stored in the index, 
because doc says positions can be used for cover density ranking and 
because to_tsvector function gives them :

select * from to_tsvector('I get lexemes and I get term positions.');
            to_tsvector
----------------------------------------
 'get':2,6 'lexem':3 'posit':8 'term':7

I can get the term positions with to_tsvector but only by parsing the 
result string, is there any more handy way ? Something like :
select * from term_and_positions('I get lexemes and I get term positions.');
 term    | positions
---------+-----------
   'get' |     {2,6}
 'lexem' |       {3}

Then, from the term positions, I need to get the character offset of 
these term positions. I assume it is NOT stored in the GIN index. By 
character offset I mean the character count from string begining to the 
term. For the previous example it would be : 'get' --> {2,20}.

I thought about using ts_headline to return the whole text with term 
tagged and then parse it to compute the character offsets from the tags. 
But this function is very slow, seems like it does not use the GIN index 
at all. And I suppose it can't because there is no way to know from a 
term position where its substring is in the text.

Now I think the only solution is to make my own C function parsing the 
text like to_tsvector does and counting terms AND characters read from 
the begining of the text to match them. I got a look on the code, and it 
does not seems easy to do because characters offset or string length are 
never used by the parsetext function (ts_parse.c). If you have any other 
suggestion, would love to hear that !

Regards, Yoann Moreau

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general