Re: vacuum analyze slows sql query

John Meinel <john@xxxxxxxxxxxxxx> · Mon, 08 Nov 2004 18:01:30 -0600

patrick ~ wrote:
Hi John,

Thanks for your reply and analysis.

No problem. It just happens that this is a problem we ran into recently.

--- John Meinel <john@xxxxxxxxxxxxxx> wrote:

patrick ~ wrote:
[...]

Hmm... The fact is I am selecting (in this example anyway) over all
values in pkk_offer table and calling the stored function with each
pkk_offer.offer_id which in turn does a select on pkk_purchase table.
Note that offer_id is a foreign key in pkk_purchase referencing
pkk_offer table.

I don't know if it matters (I suspect that it does) but I am using
LIMIT 1 in the sub-query/stored function.  All I need is one single
row meeting any of the criteria laid out in the stored procedure to
establish an offer_id is "pending".

If you are trying to establish existence, we also had a whole thread on 
this. Basically what we found was that adding an ORDER BY clause, helped 
tremendously in getting the planner to switch to an Index scan. You 
might try something like:

SELECT column FROM mytable WHERE column='myval' ORDER BY column LIMIT 1;

There seems to be a big difference between the above statement and:

SELECT column FROM mytable WHERE column='myval' LIMIT 1;

So what is the solution? The only one I'm aware of is to turn your 
static function into a dynamic one.

So somewhere within the function you build up a SQL query string and 
call EXECUTE str. This forces the query planner to be run every time you 
call the function. This means that if you call it will a "nice" value, 
you will get the fast index scan, and if you call it with a "bad" value, 
it will switch back to seq scan.

The downside is you don't get much of a benefit from using as stored 
procedure, as it has to run the query planner all the time (as though 
you issue the query manually each time.) But it still might be better 
for you in the long run.

Well, running the query without the stored function, basically typing
out the stored function as a sub-query shows me:

pkk=# explain analyze select o0.offer_id, ( select  case when ( select 
p0.purchase_id from  pkk_purchase p0 where  p0.offer_id = o0.offer_id and (
p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and
p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from
pkk_offer o0 ;
      QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on pkk_offer o0  (cost=0.00..1834.11 rows=618 width=4) (actual
time=2413.398..1341885.084 rows=618 loops=1)
   SubPlan
     ->  Result  (cost=2.94..2.95 rows=1 width=0) (actual
time=2171.287..2171.289 rows=1 loops=618)
           InitPlan
             ->  Limit  (cost=0.00..2.94 rows=1 width=4) (actual
time=2171.264..2171.266 rows=1 loops=618)
                   ->  Seq Scan on pkk_purchase p0  (cost=0.00..37225.83
rows=12670 width=4) (actual time=2171.245..2171.245 rows=1 loops=618)
                         Filter: ((offer_id = $0) AND
(((expire_time)::timestamp with time zone > now()) OR (expire_time IS NULL) OR
(pending = true)) AND ((cancel_date IS NULL) OR (pending = true)))
 Total runtime: 1341887.523 ms
(8 rows)

while deleting all statistics on the pkk_% tables I get:

pkk=# delete from pg_statistic where pg_statistic.starelid = pg_class.oid and
pg_class.relname like 'pkk_%';
DELETE 11

pkk=# explain analyze select o0.offer_id, ( select  case when ( select 
p0.purchase_id from  pkk_purchase p0 where  p0.offer_id = o0.offer_id and (
p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and
p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from
pkk_offer o0 ;
      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on pkk_offer o0  (cost=0.00..6646.94 rows=618 width=4) (actual
time=0.190..799.930 rows=618 loops=1)
   SubPlan
     ->  Result  (cost=10.73..10.74 rows=1 width=0) (actual time=1.277..1.278
rows=1 loops=618)
           InitPlan
             ->  Limit  (cost=0.00..10.73 rows=1 width=4) (actual
time=1.266..1.267 rows=1 loops=618)
                   ->  Index Scan using pur_offer_id_idx on pkk_purchase p0 
(cost=0.00..20690.18 rows=1929 width=4) (actual time=1.258..1.258 rows=1
loops=618)
                         Index Cond: (offer_id = $0)
                         Filter: ((((expire_time)::timestamp with time zone >
now()) OR (expire_time IS NULL) OR (pending = true)) AND ((cancel_date IS NULL)
OR (pending = true)))
 Total runtime: 801.234 ms
(9 rows)

As you can see this query (over all values of pkk_offer) with out
any pg_statistics on the pkk_purchase table is extremely fast.

Is this a bug in the PostgreSQL planner that misjudges the best
choice with pg_statistics at hand?

--patrick

In order to understand your query I broke it up and restructured it as 
follows.
You might try to add the ORDER BY line, and see what you get.

 EXPLAIN ANALYZE
 SELECT o0.offer_id,
     ( SELECT  CASE WHEN
         ( SELECT p0.purchase_id FROM  pkk_purchase p0
            WHERE p0.offer_id = o0.offer_id
              AND ( p0.pending = true
                    OR ( p0.cancel_date ISNULL
                         AND ( p0.expire_time > NOW() or p0.expire_time 
ISNULL )
                       )
                  )
            ORDER BY p0.purchase_id --Insert this line
            LIMIT 1
         ) ISNULL THEN false
         ELSE true
         END
     ) FROM pkk_offer o0 ;

I also wonder about some parts of your query. I don't know your business 
logic but you are tacking a lot of the query into the WHERE, and I 
wonder if postgres just thinks it's going to need to analyze all the 
data before it gets a match.

I also don't remember what columns you have indices on. Or whether it is 
common to have cancel_date null, or expire_time > NOW() or expire_time 
null, etc.

So is your function just everything within the CASE statement?

You might try rewriting it as a loop using a cursor, as I believe using 
a cursor again lends itself to index scans (as it is even more likely 
that you will not get all the data.)

Something like (this is untested)

create function is_pending(int) returns bool as '
declare
  p_id alias for $1;
begin

  DECLARE is_pending_cursor CURSOR FOR
	SELECT p0.purchase_id FROM pkk_purchase p0
	 WHERE p0.offer_id = p_id;
  FOR READ ONLY;

  FOR FETCH NEXT is_pending_cursor
      IF row.pending = true or ...
        RETURN true;

  RETURN false;
END;
';

I don't know cursors terribly well, but this might get you going. 
Probably in your case you also have a large portion of the records with 
pending = true, which means that with an index scan it doesn't have to 
hit very many records. Either you have a low record count for a 
particular purchase_id, or you have a lot of pendings. seq scan just 
hurts because it has to sift out all the other id's that you don't care 
about.

But remember, I'm not a guru, just someone who has been hit by the 
inequal distribution problem.

John
=:->
Attachment:
signature.asc

Description: OpenPGP digital signature