Search Postgresql Archives

Re: Fastest Index/Algorithm to find similar sentences

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I worked on a library project once that needed to perform similarity searches.

The first thing needed was to construct a word dictionary where there was a number corresponding to each word.
1, 'aardvark'
...
99999, 'zygote'

Then you need a list of stop words like 'AND', 'THE':
https://en.wikipedia.org/wiki/Stop_words

Then, you write a sentence parser that turns words into their numbers
So now, a bibliography entry (for example) will be a vector of numbers.

You can query with things like wordcount, word x NEAR word y, etc.
If the database supports it, you can also query with bitmap indexes.
I have not used the PostgreSQL bitmap indexes much, but they look like they might be quite useful:
http://wiki.postgresql.org/wiki/Bitmap_Indexes

We used something called ALA library parsing rules that stripped off special characters, made capitalization uniform, etc.
http://www.ala.org/tools/guidelines/standardsguidelines
Something like this project was the outcome:
http://www.ala.org/lita/ital/21/4/su

You might look into library software.  Maybe you can find something useful here:
http://www.loc.gov/marc/marctools.html

I see that there are some sourceforge MARC record projects:
http://sourceforge.net/directory/os:windows/freshness:recently-updated/?q=marc%20records




-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux