2011/10/26 Yoann Moreau <yoann.moreau@xxxxxxxxxxxxxxx>: > On 21/10/11 12:23, Yoann Moreau wrote: >> >> Hello, >> I'm using a GIN index for a text column on a big table. I use it to rank >> the rows, but I also need to get the term positions for each document of a >> subset of documents. I assume these positions are stored in the index, >> because doc says positions can be used for cover density ranking and because >> to_tsvector function gives them : >> >> select * from to_tsvector('I get lexemes and I get term positions.'); >> to_tsvector >> ---------------------------------------- >> 'get':2,6 'lexem':3 'posit':8 'term':7 >> >> I can get the term positions with to_tsvector but only by parsing the >> result string, is there any more handy way ? Something like : >> select * from term_and_positions('I get lexemes and I get term >> positions.'); >> term | positions >> ---------+----------- >> 'get' | {2,6} >> 'lexem' | {3} >> >> >> Then, from the term positions, I need to get the character offset of these >> term positions. I assume it is NOT stored in the GIN index. By character >> offset I mean the character count from string begining to the term. For the >> previous example it would be : 'get' --> {2,20}. >> >> I thought about using ts_headline to return the whole text with term >> tagged and then parse it to compute the character offsets from the tags. But >> this function is very slow, seems like it does not use the GIN index at all. >> And I suppose it can't because there is no way to know from a term position >> where its substring is in the text. >> >> Now I think the only solution is to make my own C function parsing the >> text like to_tsvector does and counting terms AND characters read from the >> begining of the text to match them. I got a look on the code, and it does >> not seems easy to do because characters offset or string length are never >> used by the parsetext function (ts_parse.c). If you have any other >> suggestion, would love to hear that ! >> >> Regards, Yoann Moreau > > Hello again, I'm sorry my need is actually a bit different than what I have > asked. I need to get the term positions using the GIN index, when I query my > text column, i.e. for a given term. > > For example for 2 rows of a 'docs' table with a text column 'text' : > 'I get lexemes and I get term positions.' > 'Did you get the positions ?' > > I'd need a function like this : > select term_positions(text, 'get') from docs; > id_doc | positions > ---------+----------- > 1 | {2,6} > 2 | {3} > > I know it can't be as simple as this, because the query would first need to > be filtered with a WHERE using a tsquery and this can't be done in the > function called like in my example. I suppose such a feature does not exist, > but is there any way to get the positions of the matching terms when > querying a GIN index ? > > The only possible way I imagine right now is to firstly filter the rows with > "to_tsvector(text) @@ to_tsquery('get')" and then call "to_tsvector(text)" > for the n highest ranked rows, parsing the string returned by the function > to find the term and its positions. But would be way more efficient to get > them directly at the first call when matching the terms with @@ operator. I > know it would be impossible if the query contain more than 1 term because it > can't return 2 arrays of position in one row (i.e. for one document), but > for now I'm trying to do this for 1 query term. > Any help or advice would be welcome ! > > By the way, I have done the C function computing the character offset of a > given term position for a text column. It's not done in a good way, but it's > more a topic for hackers postgresql list. > Don't forget when you success that word positions are affected by the word removed by stop-words. -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general