Search Postgresql Archives

Re: GIN : Working with term positions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 21/10/11 12:23, Yoann Moreau wrote:
Hello,
I'm using a GIN index for a text column on a big table. I use it to rank the rows, but I also need to get the term positions for each document of a subset of documents. I assume these positions are stored in the index, because doc says positions can be used for cover density ranking and because to_tsvector function gives them :

select * from to_tsvector('I get lexemes and I get term positions.');
            to_tsvector
----------------------------------------
 'get':2,6 'lexem':3 'posit':8 'term':7

I can get the term positions with to_tsvector but only by parsing the result string, is there any more handy way ? Something like : select * from term_and_positions('I get lexemes and I get term positions.');
 term    | positions
---------+-----------
   'get' |     {2,6}
 'lexem' |       {3}


Then, from the term positions, I need to get the character offset of these term positions. I assume it is NOT stored in the GIN index. By character offset I mean the character count from string begining to the term. For the previous example it would be : 'get' --> {2,20}.

I thought about using ts_headline to return the whole text with term tagged and then parse it to compute the character offsets from the tags. But this function is very slow, seems like it does not use the GIN index at all. And I suppose it can't because there is no way to know from a term position where its substring is in the text.

Now I think the only solution is to make my own C function parsing the text like to_tsvector does and counting terms AND characters read from the begining of the text to match them. I got a look on the code, and it does not seems easy to do because characters offset or string length are never used by the parsetext function (ts_parse.c). If you have any other suggestion, would love to hear that !

Regards, Yoann Moreau

Hello again, I'm sorry my need is actually a bit different than what I have asked. I need to get the term positions using the GIN index, when I query my text column, i.e. for a given term.

For example for 2 rows of a 'docs' table with a text column 'text' :
'I get lexemes and I get term positions.'
'Did you get the positions ?'

I'd need a function like this :
select term_positions(text, 'get') from docs;
  id_doc | positions
---------+-----------
       1 |     {2,6}
       2 |       {3}

I know it can't be as simple as this, because the query would first need to be filtered with a WHERE using a tsquery and this can't be done in the function called like in my example. I suppose such a feature does not exist, but is there any way to get the positions of the matching terms when querying a GIN index ?

The only possible way I imagine right now is to firstly filter the rows with "to_tsvector(text) @@ to_tsquery('get')" and then call "to_tsvector(text)" for the n highest ranked rows, parsing the string returned by the function to find the term and its positions. But would be way more efficient to get them directly at the first call when matching the terms with @@ operator. I know it would be impossible if the query contain more than 1 term because it can't return 2 arrays of position in one row (i.e. for one document), but for now I'm trying to do this for 1 query term.
Any help or advice would be welcome !

By the way, I have done the C function computing the character offset of a given term position for a text column. It's not done in a good way, but it's more a topic for hackers postgresql list.

Regards, Yoann Moreau

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux