I worked on a library project once that needed to perform similarity searches. The first thing needed was to construct a word dictionary where there was a number corresponding to each word. 1, 'aardvark' ... 99999, 'zygote' Then you need a list of stop words like 'AND', 'THE': https://en.wikipedia.org/wiki/Stop_words Then, you write a sentence parser that turns words into their numbers So now, a bibliography entry (for example) will be a vector of numbers. You can query with things like wordcount, word x NEAR word y, etc. If the database supports it, you can also query with bitmap indexes. I have not used the PostgreSQL bitmap indexes much, but they look like they might be quite useful: http://wiki.postgresql.org/wiki/Bitmap_Indexes We used something called ALA library parsing rules that stripped off special characters, made capitalization uniform, etc. http://www.ala.org/tools/guidelines/standardsguidelines Something like this project was the outcome: http://www.ala.org/lita/ital/21/4/su You might look into library software. Maybe you can find something useful here: http://www.loc.gov/marc/marctools.html I see that there are some sourceforge MARC record projects: http://sourceforge.net/directory/os:windows/freshness:recently-updated/?q=marc%20records -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general