On 21/10/11 12:23, Yoann Moreau wrote:
Hello,
I'm using a GIN index for a text column on a big table. I use it to
rank the rows, but I also need to get the term positions for each
document of a subset of documents. I assume these positions are stored
in the index, because doc says positions can be used for cover density
ranking and because to_tsvector function gives them :
select * from to_tsvector('I get lexemes and I get term positions.');
to_tsvector
----------------------------------------
'get':2,6 'lexem':3 'posit':8 'term':7
I can get the term positions with to_tsvector but only by parsing the
result string, is there any more handy way ? Something like :
select * from term_and_positions('I get lexemes and I get term
positions.');
term | positions
---------+-----------
'get' | {2,6}
'lexem' | {3}
Then, from the term positions, I need to get the character offset of
these term positions. I assume it is NOT stored in the GIN index. By
character offset I mean the character count from string begining to
the term. For the previous example it would be : 'get' --> {2,20}.
I thought about using ts_headline to return the whole text with term
tagged and then parse it to compute the character offsets from the
tags. But this function is very slow, seems like it does not use the
GIN index at all. And I suppose it can't because there is no way to
know from a term position where its substring is in the text.
Now I think the only solution is to make my own C function parsing the
text like to_tsvector does and counting terms AND characters read from
the begining of the text to match them. I got a look on the code, and
it does not seems easy to do because characters offset or string
length are never used by the parsetext function (ts_parse.c). If you
have any other suggestion, would love to hear that !
Regards, Yoann Moreau
Hello again, I'm sorry my need is actually a bit different than what I
have asked. I need to get the term positions using the GIN index, when I
query my text column, i.e. for a given term.
For example for 2 rows of a 'docs' table with a text column 'text' :
'I get lexemes and I get term positions.'
'Did you get the positions ?'
I'd need a function like this :
select term_positions(text, 'get') from docs;
id_doc | positions
---------+-----------
1 | {2,6}
2 | {3}
I know it can't be as simple as this, because the query would first need
to be filtered with a WHERE using a tsquery and this can't be done in
the function called like in my example. I suppose such a feature does
not exist, but is there any way to get the positions of the matching
terms when querying a GIN index ?
The only possible way I imagine right now is to firstly filter the rows
with "to_tsvector(text) @@ to_tsquery('get')" and then call
"to_tsvector(text)" for the n highest ranked rows, parsing the string
returned by the function to find the term and its positions. But would
be way more efficient to get them directly at the first call when
matching the terms with @@ operator. I know it would be impossible if
the query contain more than 1 term because it can't return 2 arrays of
position in one row (i.e. for one document), but for now I'm trying to
do this for 1 query term.
Any help or advice would be welcome !
By the way, I have done the C function computing the character offset of
a given term position for a text column. It's not done in a good way,
but it's more a topic for hackers postgresql list.
Regards, Yoann Moreau
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general