Re: tsvector limitations

Craig James <craig_james@xxxxxxxxxxxxxx> · Tue, 14 Jun 2011 14:12:05 -0700



    On 6/14/11 1:42 PM, Tim wrote:
    
      So I ran this test:

      unzip -p text.docx
        word/document.xml | perl -p -e
        's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." >
        text.txt

        ls -hal ./text.*

        #-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx

        #-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt

        mv /tmp/text.* /var/lib/postgresql/9.0/main/

        cd ~/;psql -d postgres

        #psql (9.0.4)

        CREATE DATABASE test;

        \q

        cd ~/;psql -d test

        CREATE TABLE test(title VARCHAR(256), data OID, words TSVECTOR);

        INSERT INTO test VALUES (  'text.docx', 
        LO_IMPORT('text.docx'),  TO_TSVECTOR(pg_read_file('text.txt' ,0,
        100000000))  );

        
      and I got this:

      
        #ERROR:  string is too long for tsvector (30990860 bytes, max
        1048575 bytes)

        
      doing the math 

      echo
        "scale=3;29/(30990860/1048575)"|bc #==0.981

      
      Indicates a (worst case) limit of searching only the first 1MB of
      a text file before you need to start building your own search
      maybe on top of tsvector.

      The year is 2011 I don't think searching a 2MB text file is to
      much to expect.

    
    The novel "Hawaii" at 960 pages is roughly 1MB.  tsvector was
    intended for documents (web pages, news articles, corporate memos,
    ...), not for books.  What you're asking for is interesting, but you
    can't complain that an open-source project that was designed for a
    different purpose doesn't meet your needs.

    So how am I to use the PGSQL FTS as a "full text
      search" when the above example can only handle a "small or partial
      text search"?

      If I'm not missing anything maybe the documentation should be
      adjusted accordingly.

    
    Maybe a better question is, "So how am I to use PGSQL FTS as a
    "massively huge text search" when it was designed for nothing bigger
    than "huge text search"?

    Any thoughts or alternatives are most welcome.

    
    I'm curious how tsvector could be useful on a 29 MB document. 
    That's roughly one whole encyclopedia set.  A document that size
    should have a huge vocabulary, and tsvector's index would be
    saturated.

    
    However, if the vocabulary in this 29 MB document isn't that big,
    then you might consider creating a smaller "document."  You could
    write a Perl script that scans the document and creates a dictionary
    which it writes out as a secondary "vocabulary" file that's a list
    of the unique words in your document.  Create an auxillary column in
    your database to hold this vocabulary for each document, and use
    tsvector to index that.  The perl program would be trivial, and
    tsvector would be happy.

    
    Craig