Re: looking for a faster way to do that

hamann.w@xxxxxxxxxxx · Fri, 23 Sep 2011 09:45:51 +0200

Alban Hertroys <haramrae@xxxxxxxxx> wrote:

>> What is the output of explain?
>>
>> You say 'the other table', so presumably we're dealing with a foreign key
>> here. Is there an index on that column?

Albe Laurenz wrote:

>> Is the index used for "where code ~ '^ABC3563'"?
>> 
>> If not, then the result is fast only because the table is scanned only once,
>> and it's just the factor of 3000 that's killing you.
>>  
>> The second query (where code ~ wantcode) can never use an index because
>> the pattern "wantcode" is unknown at query planning time.
>>  
>> Yours,
>> Laurenz Albe

Here I created a subset (just number and code matching a certain prefix)

\d items
          Table "pg_temp_1.items"
 Column |         Type          | Modifiers 
--------+-----------------------+-----------
 num    | integer               | 
 code   | character varying(40) | 
create index itemsc on items (code);

select count(*) from items;
 count 
-------
  9614

A single anchored query
select * from items where code ~ '^ABC';
does indeed use the index to retrieve data.

Next I copied a file of wanted codes

create temp table n (wantcode text);
\copy n from /tmp/rmartin.tmp

the file contains plain names, i.e. unanchored matches

explain analyze select num, n.wantcode from items, n where items.code ~ n.wantcode;
 Nested Loop  (cost=20.00..216502.14 rows=48070 width=36) (actual time=148.479..336280.488 rows=2871 loops=1)
   Join Filter: (("outer".code)::text ~ "inner".wantcode)
   ->  Seq Scan on items  (cost=0.00..167.14 rows=9614 width=42) (actual time=0.048..38.666 rows=9614 loops=1)
   ->  Materialize  (cost=20.00..30.00 rows=1000 width=32) (actual time=0.001..1.049 rows=815 loops=9614)
         ->  Seq Scan on n  (cost=0.00..20.00 rows=1000 width=32) (actual time=0.003..1.839 rows=815 loops=1)
 Total runtime: 336286.692 ms

An exact match  "where items.code = n.wantcode" on the same data completes in 40 ms

BTW: indexing the second table does not affect the query plan or the runtime, it just shows
actual row count rather than estimate.

This is, of course, bad; an anchored match could be faster and also is more appropriate
to the scenario. So I change the contents of the second table

update n set wantcode = textcat('^', wantcode);

and try again, with similar results
 Nested Loop  (cost=14.15..176478.01 rows=39178 width=36) (actual time=125.114..308831.697 rows=2871 loops=1)
   Join Filter: (("outer".code)::text ~ "inner".wantcode)
   ->  Seq Scan on items  (cost=0.00..167.14 rows=9614 width=42) (actual time=0.061..2034.572 rows=9614 loops=1)
   ->  Materialize  (cost=14.15..22.30 rows=815 width=32) (actual time=0.001..1.095 rows=815 loops=9614)
         ->  Seq Scan on n  (cost=0.00..14.15 rows=815 width=32) (actual time=0.114..1.893 rows=815 loops=1)
 Total runtime: 308837.746 ms

I am aware that this is unlikely to work fast (the planner would perhaps need a hint in the query
rather than in the data column to choose an anchored match algorithm (in case there is
such an algo, of course)

So I wonder whether there might be a different approach to this problem rather than
pattern matching.
I recall I had a similar problem before with a "contacts" column possibly containing one or more
email addresses. Here searches would also be number of people times number of requests
performance. I finally ended up with a @@ match (contrib/tsquery) and a supporting GIST index,
but that only supports exact match, not prefix

Regards
Wolfgang Hamann

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general