On Fri, Jul 16, 2010 at 10:23 AM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote: > Howard Rogers <hjr@xxxxxxxxxx> writes: >> I have 10 million rows in a table, with full text index created on one >> of the columns. I submit this query: > >> ims=# select count(*) from search_rm >> ims-# where to_tsvector('english', textsearch) >> ims-# @@ to_tsquery('english', 'woman & beach & ball'); >> count >> ------- >> 646 >> (1 row) >> Time: 107.570 ms > >> ...and those are excellent times. But if I alter the query to read: > >> ims=# select count(*) from search_rm >> where to_tsvector('english', textsearch) >> @@ to_tsquery('english', 'woman & beach & ftx1'); >> count >> ------- >> 38343 >> (1 row) >> Time: 640.985 ms > >> ...then, as you see, it slows the query down by a factor of about 6, > > ... um, but it increased the number of matching rows by a factor of > almost 60. I think your complaint of poor scaling is misplaced. > >> which is not so good! The problem is that we need to be able to search >> for "ftx1", since that's a flag we put in our document records to tell >> us the file type, and we need to be able to retrieve different file >> types at different times. > > You might want to rethink how you're doing that --- it seems like a file > type flag ought to be a separate column rather than a word in a text > field. > > regards, tom lane > > -- > Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > OK, Tom: I did actually account for the number of rows difference before I posted, though I accept I didn't show you that. So here goes: ims=# select count(*) ims-# from search_rm ims-# where to_tsvector('english', textsearch) @@ to_tsquery('english', 'wommmman & batt & ftxa') ims-# limit 20; count ------- 0 (1 row) Time: 0.593 ms ims=# select count(*) from search_rm where to_tsvector('english', textsearch) @@ to_tsquery('english', 'wommmman & batt & ftx1') limit 20; count ------- 0 (1 row) Time: 489.362 ms Both queries return zero rows. One takes an awful lot longer than the other. The only difference between them is that one searches for 'ftx1' and the other searches for 'ftx0'. My complaint of poor scalability (actually, it was an enquiry about the role of dictionary types!) is valid, I think. As a PostgreSQL newbie, I'm happy to accept that I've done something plonkingly stupid to account for these results, but I'd then like to know what it is I've done wrong! A simple scale-up of the number of hits isn't, however, the problem, I don't think. With this amount of data, and with 45 different document attributes that may or may not be searched for, some of them involving names and places and dates, some just yes/no flags, it is utterly impossible to have them as separate attribute columns and search on them with anything like decent performance. We adopted this approach with Oracle Text two years ago precisely because it was the only way to keep web-based searches of 10,000,000 records coming back in less than a second. So, no, we're not going to re-think the storage of 'attribute data' as part of the searchable keyword field, though I'm more than prepared to alter the precise format of that data if it helps PostgreSQL any. That said, however, we have people supplying us with document references in the form DA3-76374YY-001, so alpha-numerics simply have to be searchable with good speed, and I can't always magic-away the alpha-numeric components, even if I wanted to. So, I would still like to know if this performance difference when encountering alpha-numeric "words" is dictionary-related, and if so what I can do to fix that, please. Cheers, HJR -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general