Thanks for writing.
If one were to try to increase the limitation of tsvectors (I'm not sure I need to yet; this thread is mainly to determine that.)
Instead of using a solution involving a "vocabulary" file,
one would probably be better off discarding tsvectors making a vocabulary table then linking it to documents with a (dict_id, hit_count, word_id) table
It would be faster, smaller, and more accurate because it would not contain the now useless position information, while it would contain the otherwise lost word count information.
I wonder if anyone has any incite on the inner workings of ORACLE/MSSQL/etc FTS.
Maybe there is a common design pattern I/we can use to handle text files of non trivial vocabulary that preserves position information.
I'm not sure a (dict_id, position, word_id) table would be queryable in a useful way
because as far as I can think at the moment there is no good way to compare the position in different rows with SQL.
I will collect a few more sample files to see how much I relay need this .... multi-language files are probably the worst offenders.
On Tue, Jun 14, 2011 at 5:12 PM, Craig James <craig_james@xxxxxxxxxxxxxx> wrote:
On 6/14/11 1:42 PM, Tim wrote:The novel "Hawaii" at 960 pages is roughly 1MB. tsvector was intended for documents (web pages, news articles, corporate memos, ...), not for books. What you're asking for is interesting, but you can't complain that an open-source project that was designed for a different purpose doesn't meet your needs.So I ran this test:
unzip -p text.docx word/document.xml | perl -p -e 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txtand I got this:
ls -hal ./text.*
#-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx
#-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt
mv /tmp/text.* /var/lib/postgresql/9.0/main/
cd ~/;psql -d postgres
#psql (9.0.4)
CREATE DATABASE test;
\q
cd ~/;psql -d test
CREATE TABLE test(title VARCHAR(256), data OID, words TSVECTOR);
INSERT INTO test VALUES ( 'text.docx', LO_IMPORT('text.docx'), TO_TSVECTOR(pg_read_file('text.txt' ,0, 100000000)) );
#ERROR: string is too long for tsvector (30990860 bytes, max 1048575 bytes)doing the math
echo "scale=3;29/(30990860/1048575)"|bc #==0.981
Indicates a (worst case) limit of searching only the first 1MB of a text file before you need to start building your own search maybe on top of tsvector.
The year is 2011 I don't think searching a 2MB text file is to much to expect.
Maybe a better question is, "So how am I to use PGSQL FTS as a "massively huge text search" when it was designed for nothing bigger than "huge text search"?
So how am I to use the PGSQL FTS as a "full text search" when the above example can only handle a "small or partial text search"?
If I'm not missing anything maybe the documentation should be adjusted accordingly.
I'm curious how tsvector could be useful on a 29 MB document. That's roughly one whole encyclopedia set. A document that size should have a huge vocabulary, and tsvector's index would be saturated.
Any thoughts or alternatives are most welcome.
However, if the vocabulary in this 29 MB document isn't that big, then you might consider creating a smaller "document." You could write a Perl script that scans the document and creates a dictionary which it writes out as a secondary "vocabulary" file that's a list of the unique words in your document. Create an auxillary column in your database to hold this vocabulary for each document, and use tsvector to index that. The perl program would be trivial, and tsvector would be happy.
Craig