Re: Full Text Search dictionary issues

Howard Rogers <hjr@xxxxxxxxxx> · Fri, 16 Jul 2010 14:22:19 +1000

On Fri, Jul 16, 2010 at 10:23 AM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
> Howard Rogers <hjr@xxxxxxxxxx> writes:
>> I have 10 million rows in a table, with full text index created on one
>> of the columns. I submit this query:
>
>> ims=# select count(*) from search_rm
>> ims-# where to_tsvector('english', textsearch)
>> ims-# @@ to_tsquery('english', 'woman & beach & ball');
>>  count
>> -------
>>    646
>> (1 row)
>> Time: 107.570 ms
>
>> ...and those are excellent times. But if I alter the query to read:
>
>> ims=# select count(*) from search_rm
>> where to_tsvector('english', textsearch)
>> @@ to_tsquery('english', 'woman & beach & ftx1');
>>  count
>> -------
>>  38343
>> (1 row)
>> Time: 640.985 ms
>
>> ...then, as you see, it slows the query down by a factor of about 6,
>
> ... um, but it increased the number of matching rows by a factor of
> almost 60.  I think your complaint of poor scaling is misplaced.
>
>> which is not so good! The problem is that we need to be able to search
>> for "ftx1", since that's a flag we put in our document records to tell
>> us the file type, and we need to be able to retrieve different file
>> types at different times.
>
> You might want to rethink how you're doing that --- it seems like a file
> type flag ought to be a separate column rather than a word in a text
> field.
>
>                        regards, tom lane
>
> --
> Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

OK, Tom: I did actually account for the number of rows difference
before I posted, though I accept I didn't show you that. So here goes:

ims=# select count(*)
ims-# from search_rm
ims-# where to_tsvector('english', textsearch) @@
to_tsquery('english', 'wommmman & batt & ftxa')
ims-# limit 20;
 count
-------
    0
(1 row)

Time: 0.593 ms
ims=# select count(*)
from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english',
'wommmman & batt & ftx1')
limit 20;
 count
-------
    0
(1 row)

Time: 489.362 ms

Both queries return zero rows. One takes an awful lot longer than the
other. The only difference between them is that one searches for
'ftx1' and the other searches for 'ftx0'. My complaint of poor
scalability (actually, it was an enquiry about the role of dictionary
types!) is valid, I think. As a PostgreSQL newbie, I'm happy to accept
that I've done something plonkingly stupid to account for these
results, but I'd then like to know what it is I've done wrong! A
simple scale-up of the number of hits isn't, however, the problem, I
don't think.

With this amount of data, and with 45 different document attributes
that may or may not be searched for, some of them involving names and
places and dates, some just yes/no flags, it is utterly impossible to
have them as separate attribute columns and search on them with
anything like decent performance. We adopted this approach with Oracle
Text two years ago precisely because it was the only way to keep
web-based searches of 10,000,000 records coming back in less than a
second. So, no, we're not going to re-think the storage of 'attribute
data' as part of the searchable keyword field, though I'm more than
prepared to alter the precise format of that data if it helps
PostgreSQL any.

That said, however, we have people supplying us with document
references in the form DA3-76374YY-001, so alpha-numerics simply have
to be searchable with good speed, and I can't always magic-away the
alpha-numeric components, even if I wanted to.

So, I would still like to know if this performance difference when
encountering alpha-numeric "words" is dictionary-related, and if so
what I can do to fix that, please.

Cheers,
HJR

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general