My limited testing indicates tsvector size has an approximately linear (binomial) correlation with "number of unique words", and "word count".
Presumably your "nonlinear" remark was (correctly) directed at the correlation between file size and tsvector size.
I found that:
VS=UW*AWS+2*WC+550, where PWC<160
approximates the size of a tsvector.In other words:
tsvector size max 1048575 bytes=(constant#1)*(number of unique words)*(average word size)+(constant#2)*(word count)+(constant#3)
, where per word count<160 (seems related to the "No more than 256 positions per lexeme" restriction) average word size can be approximated.
I'm sure this there are some things the calculation is not accounting for (maybe word variation) but it seemed to work decently for my limited test.
I would not be surprised if it lost accuracy by an order of magnitude applied to a larger data set without improvement.
So a tsvector might hold about 147,609 unique words, or 69,405 with an average repeat of 10.
Practically this limitation is unlikely to be important, however likely it is to be hit a few times among the millions of people that use PostgreSQL (knowingly or otherwise).
The Oxford English Dictionary claims there are about 228,132 unique words with an average word length of 5.1 ( about 2.4M).
The test file I used data from had 2,972,885 words (27M) (average word length of 9).
Some of my testing:
echo "(1*100)*(6.560)+(2*100)+550"|bc
1406.000 # calculated
1406 # expected
echo "(1*100)*(6.560)+(2*1000)+550"|bc
3206.000 # calculated
3202 # expected
echo "(1*100)*(6.560)+(2*5000)+550"|bc
11206.000 # calculated
11202 # expected
echo "(1*100)*(6.560)+(2*10000)+550"|bc
21206.000 # calculated
21202 # expected
echo "(1*100)*(6.560)+(2*100)+550"|bc
1406.000 # calculated
1406 # expected
echo "(1*200)*(6.575)+(2*200)+550"|bc
2265.000 # calculated
2726 # expected
echo "(1*500)*(7.572)+(2*500)+550"|bc
5336.000 # calculated
7378 # expected
echo "(1*1000)*(7.792)+(2*1000)+550"|bc
10342.000 # calculated
10736 # expected
echo "(1*1500)*(8.302)+(2*1500)+550"|bc
16003.000 # calculated
15738 # expected
1406.000 # calculated
1406 # expected
echo "(1*100)*(6.560)+(2*1000)+550"|bc
3206.000 # calculated
3202 # expected
echo "(1*100)*(6.560)+(2*5000)+550"|bc
11206.000 # calculated
11202 # expected
echo "(1*100)*(6.560)+(2*10000)+550"|bc
21206.000 # calculated
21202 # expected
echo "(1*100)*(6.560)+(2*100)+550"|bc
1406.000 # calculated
1406 # expected
echo "(1*200)*(6.575)+(2*200)+550"|bc
2265.000 # calculated
2726 # expected
echo "(1*500)*(7.572)+(2*500)+550"|bc
5336.000 # calculated
7378 # expected
echo "(1*1000)*(7.792)+(2*1000)+550"|bc
10342.000 # calculated
10736 # expected
echo "(1*1500)*(8.302)+(2*1500)+550"|bc
16003.000 # calculated
15738 # expected
File sizes:
ls -hal text.*
-rwxrwxrwx 1 postgres postgres 6.5K 2011-06-15 19:42 text.100x10.txt
-rwxrwxrwx 1 postgres postgres 33K 2011-06-15 19:49 text.100x50.txt
-rwxrwxrwx 1 postgres postgres 65K 2011-06-15 19:41 text.100x100.txt
-rwxrwxrwx 1 postgres postgres 97K 2011-06-15 20:05 text.100x150.txt
-rwxrwxrwx 1 postgres postgres 656 2011-06-15 18:01 text.100.txt
-rwxrwxrwx 1 postgres postgres 1.3K 2011-06-15 20:51 text.200.txt
-rwxrwxrwx 1 postgres postgres 3.7K 2011-06-15 20:52 text.500.txt
-rwxrwxrwx 1 postgres postgres 7.7K 2011-06-15 20:52 text.1000.txt
-rwxrwxrwx 1 postgres postgres 13K 2011-06-15 20:52 text.1500.txt
-rwxrwxrwx 1 postgres postgres 6.5K 2011-06-15 19:42 text.100x10.txt
-rwxrwxrwx 1 postgres postgres 33K 2011-06-15 19:49 text.100x50.txt
-rwxrwxrwx 1 postgres postgres 65K 2011-06-15 19:41 text.100x100.txt
-rwxrwxrwx 1 postgres postgres 97K 2011-06-15 20:05 text.100x150.txt
-rwxrwxrwx 1 postgres postgres 656 2011-06-15 18:01 text.100.txt
-rwxrwxrwx 1 postgres postgres 1.3K 2011-06-15 20:51 text.200.txt
-rwxrwxrwx 1 postgres postgres 3.7K 2011-06-15 20:52 text.500.txt
-rwxrwxrwx 1 postgres postgres 7.7K 2011-06-15 20:52 text.1000.txt
-rwxrwxrwx 1 postgres postgres 13K 2011-06-15 20:52 text.1500.txt
Average word lengths:
bash average_word_length.sh
text.100.txt 6.560
text.200.txt 6.575
text.500.txt 7.572
text.1000.txt 7.792
text.1500.txt 8.302
text.100.txt 6.560
text.200.txt 6.575
text.500.txt 7.572
text.1000.txt 7.792
text.1500.txt 8.302
Tsvector sizes:
select title,pg_column_size(words) from test order by pg_column_size;
title | pg_column_size
-------------------+----------------
text.100.txt | 1406
text.100x10.txt | 3202
text.100x50.txt | 11202
text.100x100.txt | 21202
text.100x150.txt | 31112
text.100.txt | 1406
text.200.txt | 2726
text.500.txt | 7378
text.1000.txt | 10736
text.1500.txt | 15738
title | pg_column_size
-------------------+----------------
text.100.txt | 1406
text.100x10.txt | 3202
text.100x50.txt | 11202
text.100x100.txt | 21202
text.100x150.txt | 31112
text.100.txt | 1406
text.200.txt | 2726
text.500.txt | 7378
text.1000.txt | 10736
text.1500.txt | 15738
On Wed, Jun 15, 2011 at 2:31 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
"Mark Johnson" <mark@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> writes:... which it won't. There is no real-world full text indexing
> When this discussion first started, I immediately thought about people
> who full text index their server's log files. As a test I copied
> /var/log/messages to $PGDATA and then used the same pg_read_file()
> function you mentioned earlier to pull the data into a column of type
> text. The original file was 4.3 MB, and the db column had length
> 4334920 and the function pg_column_size reported a size of 1058747. I
> then added a column named tsv of type tsvector, and populated it using
> to_tsvector(). The function pg_column_size reported 201557. So in this
> test a 4.2 MB text file produced a tsvector of size 200 KB. If this
> scales linearly,
application where there aren't many duplications of words. (The OP
eventually admitted that his "test case" was a dictionary word list
and not an actual document.) Any discussion of required tsvector
sizes that doesn't account for the actual, nonlinear scaling behavior
isn't worth the electrons it's printed on.
regards, tom lane