Re: vacuum analyze slows sql query

patrick ~ <sidsrr@xxxxxxxxx> · Mon, 8 Nov 2004 15:19:51 -0800 (PST)

Hi John,

Thanks for your reply and analysis.

--- John Meinel <john@xxxxxxxxxxxxxx> wrote:

> patrick ~ wrote:
> [...]
> > pkk=# explain analyze execute pkk_01(    241 );
> >       QUERY PLAN
> >
>
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >  Result  (cost=2.66..2.67 rows=1 width=0) (actual time=2872.211..2872.213
> > rows=1 loops=1)
> >    InitPlan
> >      ->  Limit  (cost=0.00..2.66 rows=1 width=4) (actual
> > time=2872.189..2872.189 rows=0 loops=1)
> >            ->  Seq Scan on pkk_purchase p0  (cost=0.00..37225.83 rows=13983
> > width=4) (actual time=2872.180..2872.180 rows=0 loops=1)
> >                  Filter: ((offer_id = $1) AND (((expire_time)::timestamp
> with
> > time zone > now()) OR (expire_time IS NULL) OR (pending = true)) AND
> > ((cancel_date IS NULL) OR (pending = true)))
> >  Total runtime: 2872.339 ms
> > (6 rows)
> > 
> > Time: 2873.479 ms
> > 
> 
> [...]
> 
> > So, is this the ultimate solution to this issue?
> > 
> > --patrick
> 
> It's not so much that correlation is < 0.5. It sounds like you're 
> running into the same issue that I ran into in the past. You have a 
> column with lots of repeated values, and a few exceptional ones. Notice 
> this part of the query:
> ->  Seq Scan on pkk_purchase p0  (cost rows=13983) (actual rows=0)
> 
> For a general number, it thinks it might return 14,000 rows, hence the 
> sequential scan. Before you do ANALYZE, it uses whatever defaults exist, 
> which are probably closer to reality.
> 
> The problem is that you probably have some values for pkk_purchase where 
> it could return 14,000 rows (possibly much much more). And for those, 
> seq scan is the best plan. However, for the particular value that you 
> are testing, there are very few (no) entries in the table.

You are absoultely correct:

pkk=# select offer_id,count(*) from pkk_purchase group by offer_id order by
count ;
 offer_id | count  
----------+--------
     1019 |      1
     1018 |      1
     1016 |      1 (many of these)
      ... |    ...
     2131 |      6
      844 |      6
     1098 |      6 (a dozen or so of these)
      ... |    ...
     2263 |    682
     2145 |    723
     2258 |    797
     2091 |    863
      ... |    ...
     1153 |  96330 (the few heavy weights)
      244 | 122163
      242 | 255719
      243 | 273427
      184 | 348476

> With a prepared statement (or a function) it has to determine ahead of 
> time what the best query is without knowing what value you are going to 
>   ask for.
> 
> Lets say for a second that you manage to trick it into using index scan, 
> and then you actually call the function with one of the values that 
> returns 1,000s of rows. Probably it will take 10-100 times longer than 
> if it used a seq scan.

Hmm... The fact is I am selecting (in this example anyway) over all
values in pkk_offer table and calling the stored function with each
pkk_offer.offer_id which in turn does a select on pkk_purchase table.
Note that offer_id is a foreign key in pkk_purchase referencing
pkk_offer table.

I don't know if it matters (I suspect that it does) but I am using
LIMIT 1 in the sub-query/stored function.  All I need is one single
row meeting any of the criteria laid out in the stored procedure to
establish an offer_id is "pending".

> So what is the solution? The only one I'm aware of is to turn your 
> static function into a dynamic one.
> 
> So somewhere within the function you build up a SQL query string and 
> call EXECUTE str. This forces the query planner to be run every time you 
> call the function. This means that if you call it will a "nice" value, 
> you will get the fast index scan, and if you call it with a "bad" value, 
> it will switch back to seq scan.
> 
> The downside is you don't get much of a benefit from using as stored 
> procedure, as it has to run the query planner all the time (as though 
> you issue the query manually each time.) But it still might be better 
> for you in the long run.

Well, running the query without the stored function, basically typing
out the stored function as a sub-query shows me:

pkk=# explain analyze select o0.offer_id, ( select  case when ( select 
p0.purchase_id from  pkk_purchase p0 where  p0.offer_id = o0.offer_id and (
p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and
p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from
pkk_offer o0 ;
      QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on pkk_offer o0  (cost=0.00..1834.11 rows=618 width=4) (actual
time=2413.398..1341885.084 rows=618 loops=1)
   SubPlan
     ->  Result  (cost=2.94..2.95 rows=1 width=0) (actual
time=2171.287..2171.289 rows=1 loops=618)
           InitPlan
             ->  Limit  (cost=0.00..2.94 rows=1 width=4) (actual
time=2171.264..2171.266 rows=1 loops=618)
                   ->  Seq Scan on pkk_purchase p0  (cost=0.00..37225.83
rows=12670 width=4) (actual time=2171.245..2171.245 rows=1 loops=618)
                         Filter: ((offer_id = $0) AND
(((expire_time)::timestamp with time zone > now()) OR (expire_time IS NULL) OR
(pending = true)) AND ((cancel_date IS NULL) OR (pending = true)))
 Total runtime: 1341887.523 ms
(8 rows)

while deleting all statistics on the pkk_% tables I get:

pkk=# delete from pg_statistic where pg_statistic.starelid = pg_class.oid and
pg_class.relname like 'pkk_%';
DELETE 11

pkk=# explain analyze select o0.offer_id, ( select  case when ( select 
p0.purchase_id from  pkk_purchase p0 where  p0.offer_id = o0.offer_id and (
p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and
p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from
pkk_offer o0 ;
      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on pkk_offer o0  (cost=0.00..6646.94 rows=618 width=4) (actual
time=0.190..799.930 rows=618 loops=1)
   SubPlan
     ->  Result  (cost=10.73..10.74 rows=1 width=0) (actual time=1.277..1.278
rows=1 loops=618)
           InitPlan
             ->  Limit  (cost=0.00..10.73 rows=1 width=4) (actual
time=1.266..1.267 rows=1 loops=618)
                   ->  Index Scan using pur_offer_id_idx on pkk_purchase p0 
(cost=0.00..20690.18 rows=1929 width=4) (actual time=1.258..1.258 rows=1
loops=618)
                         Index Cond: (offer_id = $0)
                         Filter: ((((expire_time)::timestamp with time zone >
now()) OR (expire_time IS NULL) OR (pending = true)) AND ((cancel_date IS NULL)
OR (pending = true)))
 Total runtime: 801.234 ms
(9 rows)

As you can see this query (over all values of pkk_offer) with out
any pg_statistics on the pkk_purchase table is extremely fast.

Is this a bug in the PostgreSQL planner that misjudges the best
choice with pg_statistics at hand?

--patrick

__________________________________ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com