Re: Full Text Search dictionary issues

Howard Rogers <hjr@xxxxxxxxxx> · Sat, 17 Jul 2010 06:51:46 +1000

On Sat, Jul 17, 2010 at 3:14 AM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
> Howard Rogers <hjr@xxxxxxxxxx> writes:
>> OK, Tom: I did actually account for the number of rows difference
>> before I posted, though I accept I didn't show you that. So here goes:
>> ...
>> Both queries return zero rows. One takes an awful lot longer than the
>> other. The only difference between them is that one searches for
>> 'ftx1' and the other searches for 'ftx0'.
>
> Well, this still doesn't tell us anything about what I think the
> critical point is, namely how many actual matches there are for
> ftx1 versus ftx0.  Could we see counts for *just* those words without
> the other conditions?
>
>> So, I would still like to know if this performance difference when
>> encountering alpha-numeric "words" is dictionary-related,
>
> AFAIK there is no significant difference between treatment of pure alpha
> and mixed alphanumeric "words", at least not once you get past
> to_tsquery.  I'm still expecting this is just a matter of how many index
> entries match.  It's barely possible that you've got a dictionary
> configuration that makes the to_tsquery() function itself a lot slower
> in the alphanumeric case, but that should affect ftx1 and ftx0 equally.
>
>                        regards, tom lane
>
> --
> Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

My apologies for making this harder than it should have been. My
queries were correct, my description of them wasn't. The only
difference was 'ftx1' and 'ftxa', not 'ftx0'.

Anyway, I think I've finally got the point being made by you and
Richard (and sorry for being so slow about it!): ftx1 is a very, very
common flag. I'd guess 99% of records would be tagged with it. And
ftxa doesn't exist at all, of course. So, you're right:

ims=# select count(*) from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english','ball
& beach & salsm1');
 count
-------
  1753
(1 row)

Time: 557.010 ms
ims=# select count(*) from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english','ball
& beach & salsm4');
 count
-------
    97
(1 row)

Time: 39.518 ms

Salsm1 is another very common flag, applied to about 80% of documents.
Salsm4 applies to about 160 documents in total. So it's really not the
alpha-numerics causing the issue after all. It's genuinely the number
of records matching each term, as you originally said. (And a note to
Steve: it makes no difference separating out the 'flag factor', I'm
afraid).

It's given me lots to think about. The hard part ought to be finding
the women, or the beaches; the flags should be mere filters applied
after those have been fetched. Clearly, though, that's not how this is
behaving, though I've been fooled into thinking it ought to be because
of the apparent default optimization done in Oracle Text. Some new
technique is called for, I guess!  (If you've got any ideas, I'm all
ears...)

Anyway: at least you've all helped me realise that it's not a
dictionary problem (or a word-type) problem, so thank you all very
much for that; much appreciated.

Regards
HJR

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general