On 6/14/11 1:42 PM, Tim wrote:
So I ran this test:The novel "Hawaii" at 960 pages is roughly 1MB. tsvector was intended for documents (web pages, news articles, corporate memos, ...), not for books. What you're asking for is interesting, but you can't complain that an open-source project that was designed for a different purpose doesn't meet your needs. So how am I to use the PGSQL FTS as a "full text search" when the above example can only handle a "small or partial text search"?Maybe a better question is, "So how am I to use PGSQL FTS as a "massively huge text search" when it was designed for nothing bigger than "huge text search"? Any thoughts or alternatives are most welcome.I'm curious how tsvector could be useful on a 29 MB document. That's roughly one whole encyclopedia set. A document that size should have a huge vocabulary, and tsvector's index would be saturated. However, if the vocabulary in this 29 MB document isn't that big, then you might consider creating a smaller "document." You could write a Perl script that scans the document and creates a dictionary which it writes out as a secondary "vocabulary" file that's a list of the unique words in your document. Create an auxillary column in your database to hold this vocabulary for each document, and use tsvector to index that. The perl program would be trivial, and tsvector would be happy. Craig |