Re: Similarity search for sentences

Rémi Cura <remi.cura@xxxxxxxxx> · Thu, 5 Dec 2013 13:12:55 +0100

May be totally a bad idea :
explode your sentence into(sentence_number, one_word), n times , (makes a big table, you may want to partition)
then, classic index on sentence number, and on the one world (btree if you make = comparison , more subtel if you do "like 'word' ")

depending on perf, it could be wort it to regroup by words :
sentence_number[], on_word
Then you could try array or hstore on sentence_number[] ?

Cheers,

Rémi-C

2013/12/5 Janek Sendrowski <janek12@xxxxxx>

Hi,

I have tables with millions of sentences. Each row contains a sentence. It is natural language and every language is possible, but the sentences of one table have the same language.

I have to do a similarity search on them. It has to be very fast, because I have to search for a few hundert sentences many times.

The search shouldn't be context-based. It should just get sentences with similar words(maybe stemmed).

I already had a try with gist/gin-index-based trigramm search (pg_trgm extension), fulltextsearch (tsearch2 extension) and a pivot-based indexing (Fixed Query Array), but it's all to slow or not suitable.

Soundex and Metaphone aren't suitable, as well.

I'm already working on this project since a long time, but without any success.

Do any of you have an idea?

I would be very thankful for help.

Janek Sendrowski

--

Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)

To make changes to your subscription:

http://www.postgresql.org/mailpref/pgsql-general