My test was indeed atypical vocabulary; it was a dictionary file.
I was intentionally trying to hit the limit to find out where it was, because the documentation did not directly address it.
I am mainly trying to find out if this actually will be a limitation for me.
Thank you for contributing the test data on the PostgreSQL docs (text=11MB,tsvector=0.356MB) and anonymous technical book (text=0.2MB, tsvector=0.1MB).
It seems that as long as a document uses a small % of the potential language vocabulary it would be hard to hit the limit.
On the other hand if someone tries to upload a dictionary or a file using more than 4% of one vocabulary it would need special handling.
(I guess this would be more likely on larger vocabularies like maybe the Japanese writing systems)
The 4% number is probably low due to misspellings, slang, etc.
Anyway I now have an approximate answer to the original question of where the limit is,
and it's probably safe to for my goals to just check and warn if a file can't be indexed.
On Tue, Jun 14, 2011 at 6:19 PM, Kevin Grittner <Kevin.Grittner@xxxxxxxxxxxx> wrote:
> So I ran this test:Your test (whatever data it is that you used) don't seem typical of
> unzip -p text.docx word/document.xml | perl -p -e
> 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt
> ls -hal ./text.*
> #-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx
> #-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt
> mv /tmp/text.* /var/lib/postgresql/9.0/main/
> cd ~/;psql -d postgres
> #psql (9.0.4)
> CREATE DATABASE test;
> \q
> cd ~/;psql -d test
> CREATE TABLE test(title VARCHAR(256), data OID, words TSVECTOR);
> INSERT INTO test VALUES ( 'text.docx', LO_IMPORT('text.docx'),
> TO_TSVECTOR(pg_read_file('text.txt' ,0, 100000000)) );
>
> and I got this:
> #ERROR: string is too long for tsvector (30990860 bytes, max
> 1048575 bytes)
English text. The entire PostgreSQL documentation in HTML form,
when all the html files are concatenated is 11424165 bytes (11MB),
and the tsvector of that is 364410 (356KB). I don't suppose you
know of some publicly available file on the web that I could use to
reproduce your problem?
Based on the ratio for the PostgreSQL docs, it seems possible to
> The year is 2011 I don't think searching a 2MB text file is to
> much to expect.
index documents considerably larger than that. Without the markup
(as in the case of a PDF), I bet it would take a lot less than what
I saw for the docs. A printed or typewritten page usually has about
2KB of text per page. I used pdftotext to get as text the contents
of a 119 page technical book about database technology, and it came
to 235KB of text. I made a tsvector for that, and it was 99KB. So,
at *that* rate you'd need about 100 books that size, totaling
11,900 pages of text in a document to hit the limit you showed.
Well, probably more than that, because some of the words might be
repeated from one book to another.
So, I'm back to wondering what problem you're trying to solve where
this is actually a limitation for you.
-Kevin