Re: ranking how "similar" are tsvectors was: OR tsquery

Oleg Bartunov <oleg@xxxxxxxxxx> · Sun, 17 Jan 2010 20:19:59 +0300 (MSK)

Ivan,

You can write function to get lexemes from tsvector:

CREATE OR REPLACE FUNCTION ts_stat(tsvector, weights text, OUT word text, OUT ndoc
integer, OUT nentry integer)
RETURNS SETOF record AS
$$
    SELECT ts_stat('SELECT ' || quote_literal( $1::text ) || '::tsvector', quote_literal( $2::text) );
$$ LANGUAGE SQL RETURNS NULL ON NULL INPUT IMMUTABLE;

Then, you can create ARRAY like:

select ARRAY ( select (ts_stat(fts,'*')).word from papers where id=2);

Then, you will have two arrays and you're free to apply any similarity
function (cosine, jaccard,....) to calculate what do you want.
If you want to preserve weights, then use weight label instead of '*'.

Another idea is to use array_agg, but I'm not ready to discuss it.

Please, keep in mind, that document similarity is a hot topic in IR,
and, yes, I and Teodor have something about this, but code isn't available
for public. Unfortunately, we had no sponsor for full-text search for last
year and I see no perspectives this year, so we postpone our text-search 
development.

Oleg

On Sun, 17 Jan 2010, Ivan Sergio Borgonovo wrote:

My initial request was about a way to build up a tsquery that was
made similar to what plainto_tsquery does but using | inspite of &
as a glue.

But at the end of the day I'd like to find similar tsvectors and
rank them.

I've a table containing several fields that contribute to build up a
weighted tsvector.

I'd like to pick up a tsvector and find which are the N most similar
ones.

I've found this:

http://domas.monkus.lt/document-similarity-postgresql

That's not really too far from what I was trying to do.

But I have precomputed tsvectors (I think turning text into a
tsvector should be a more expensive operation than string
replacement) and I'd like to conserve weights.

I'm not really sure but I think a lexeme can actually contain a '
or a space (depending on stemmer/parser?), so I'd have to take care
of escaping etc...

Since there is no direct access to the elements of a tsvector... the
only "correct" way I see to build the query would be to manually
rebuild the tsvector and getting back the result as a record using
ts_debug and ts_lexize... that looks a bit a PITA.

I don't even think that having direct access to elements of a
tsvector will completely solve the problem since tsvectors store
positions too, but it will be a step forward in making easier to
compare documents to find similar ones.
An operator that check the intersection of tsvectors would come
handy.
Adding a ts_rank(tsvector, tsvector) will surely help too.

thanks

	Regards,
		Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@xxxxxxxxxx, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general