Re: GIN : Working with term positions

Yoann Moreau <yoann.moreau@xxxxxxxxxxxxxxx> · Wed, 26 Oct 2011 15:38:19 +0200

On 21/10/11 12:23, Yoann Moreau wrote:
Hello,
I'm using a GIN index for a text column on a big table. I use it to 
rank the rows, but I also need to get the term positions for each 
document of a subset of documents. I assume these positions are stored 
in the index, because doc says positions can be used for cover density 
ranking and because to_tsvector function gives them :

select * from to_tsvector('I get lexemes and I get term positions.');
            to_tsvector
----------------------------------------
 'get':2,6 'lexem':3 'posit':8 'term':7

I can get the term positions with to_tsvector but only by parsing the 
result string, is there any more handy way ? Something like :
select * from term_and_positions('I get lexemes and I get term 
positions.');
 term    | positions
---------+-----------
   'get' |     {2,6}
 'lexem' |       {3}

Then, from the term positions, I need to get the character offset of 
these term positions. I assume it is NOT stored in the GIN index. By 
character offset I mean the character count from string begining to 
the term. For the previous example it would be : 'get' --> {2,20}.

I thought about using ts_headline to return the whole text with term 
tagged and then parse it to compute the character offsets from the 
tags. But this function is very slow, seems like it does not use the 
GIN index at all. And I suppose it can't because there is no way to 
know from a term position where its substring is in the text.

Now I think the only solution is to make my own C function parsing the 
text like to_tsvector does and counting terms AND characters read from 
the begining of the text to match them. I got a look on the code, and 
it does not seems easy to do because characters offset or string 
length are never used by the parsetext function (ts_parse.c). If you 
have any other suggestion, would love to hear that !

Regards, Yoann Moreau

Hello again, I'm sorry my need is actually a bit different than what I 
have asked. I need to get the term positions using the GIN index, when I 
query my text column, i.e. for a given term.

For example for 2 rows of a 'docs' table with a text column 'text' :
'I get lexemes and I get term positions.'
'Did you get the positions ?'

I'd need a function like this :
select term_positions(text, 'get') from docs;
  id_doc | positions
---------+-----------
       1 |     {2,6}
       2 |       {3}

I know it can't be as simple as this, because the query would first need 
to be filtered with a WHERE using a tsquery and this can't be done in 
the function called like in my example. I suppose such a feature does 
not exist, but is there any way to get the positions of the matching 
terms when querying a GIN index ?

The only possible way I imagine right now is to firstly filter the rows 
with "to_tsvector(text) @@ to_tsquery('get')" and then call 
"to_tsvector(text)" for the n highest ranked rows, parsing the string 
returned by the function to find the term and its positions. But would 
be way more efficient to get them directly at the first call when 
matching the terms with @@ operator. I know it would be impossible if 
the query contain more than 1 term because it can't return 2 arrays of 
position in one row (i.e. for one document), but for now I'm trying to 
do this for 1 query term.
Any help or advice would be welcome !

By the way, I have done the C function computing the character offset of 
a given term position for a text column. It's not done in a good way, 
but it's more a topic for hackers postgresql list.

Regards, Yoann Moreau

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general