Full Text Search dictionary issues

Howard Rogers <hjr@xxxxxxxxxx> · Fri, 16 Jul 2010 09:20:38 +1000

I have 10 million rows in a table, with full text index created on one
of the columns. I submit this query:

ims=# select count(*) from search_rm
ims-# where to_tsvector('english', textsearch)
ims-# @@ to_tsquery('english', 'woman & beach & ball');
 count
-------
   646
(1 row)
Time: 107.570 ms

...and those are excellent times. But if I alter the query to read:

ims=# select count(*) from search_rm
where to_tsvector('english', textsearch)
@@ to_tsquery('english', 'woman & beach & ftx1');
 count
-------
 38343
(1 row)
Time: 640.985 ms

...then, as you see, it slows the query down by a factor of about 6,
which is not so good! The problem is that we need to be able to search
for "ftx1", since that's a flag we put in our document records to tell
us the file type, and we need to be able to retrieve different file
types at different times.

Now, I *think* the problem is that 'ftx1' is not being treated as
though it were a proper word:

ims=# select * from ts_debug('english','woman ball ftx1');
   alias   |       description        | token |  dictionaries  |
dictionary  | lexemes
-----------+--------------------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII          | woman | {english_stem} |
english_stem | {woman}
 blank     | Space symbols            |       | {}             |              |
 asciiword | Word, all ASCII          | ball  | {english_stem} |
english_stem | {ball}
 blank     | Space symbols            |       | {}             |              |
 numword   | Word, letters and digits | ftx1  | {simple}       |
simple       | {ftx1}
(5 rows)

Instead of being an asciiword that uses the english-stem dictionary,
it 'ftx1' gets regarded as a numword in the simple dictionary.

If I simply replace "ftxa" for "ftx1", it *is* then regarded as an
asciiword, and performance of the original query reverts to being just
fine, too:

ims=# select * from ts_debug('english','woman ball ftxa');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | woman | {english_stem} | english_stem | {woman}
 blank     | Space symbols   |       | {}             |              |
 asciiword | Word, all ASCII | ball  | {english_stem} | english_stem | {ball}
 blank     | Space symbols   |       | {}             |              |
 asciiword | Word, all ASCII | ftxa  | {english_stem} | english_stem | {ftxa}

ims=# select count(*) from search_rm
where to_tsvector('english', textsearch)
@@ to_tsquery('english', 'woman & beach & ftxa');
 count
-------
     0
(1 row)

Time: 88.603 ms

As you can see, 88ms for a search with 'ftxa' compared to 600+ms for
one with 'ftx1'.

I should mention that we have about 45 different 'nonsense word' flags
we use for all sorts of different purposes, such as telling us which
region a document is visible in, whether it is in portrait or
landscape mode and so on. All of these flag-words take the form of
zzzz1, or yyyy2 and so on. So there's a lot of these things causing
the problem, not just 'ftx1' specifically.

My question is, then, what I can do to stop this slowdown? Is there
some way to add 'ftx1' (and the others) as a word in the english_stem
dictionary so that it gets regarded as an asciiword, for example? Or
is there something else I can do to address the problem?

I'm fairly new to PostgreSQL's full text search. I've read Chapter 12
of the doco with rapt attention, but I don't see anything that leaps
at me as a fix for this issue. All help greatefully received,
therefore, and apologies in advance if this is a bit of a newbie
question.

Regards
HJR

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general