Re: limit clause breaks query planner?

"David West" <david.west@xxxxxxxxxxxxx> · Tue, 2 Sep 2008 15:09:00 +0100

Thanks very much for your help Guillaume, I appreciate you spending time on
this.

> Well, if your have 95% of NULL actorid_ and 10% for each value of
> pooledactor_, then it makes sense to assume it will have to fetch
> about 150 rows to find the 15 awaited ones...

This is only true if the data is randomly distributed, which it isn't
unfortunately.

To any postgres developers reading this, two words: planning hints :-).  In
the face of imperfect information it's not possible to write a perfect
planner, please give us the ability to use the heuristic information we have
as developers, and the planner will never know about.  Even if we force
postgres to use queries that give sub-optimal performance some of the time,
once we can write sufficiently performant queries, we're happy.  In cases
like this where postgres gets it very wrong (and this is just a very simple
query after all), well, we're screwed.

I'm going to try partitioning my database along the pooledactor_ column to
see if I can get reasonable performance for my purposes, even if I can't
reach 10 million rows.

Thanks
David

-----Original Message-----
From: Guillaume Cottenceau [mailto:gc@xxxxxx] 
Sent: 02 September 2008 14:56
To: David West
Cc: pgsql-performance@xxxxxxxxxxxxxx
Subject: Re: [PERFORM] limit clause breaks query planner?

"David West" <david.west 'at' cusppoint.com> writes:

> INFO:  "jbpm_taskinstance": moved 1374243 row versions, truncated 166156
to
> 140279 pages

nothing which would explain so much planning off :/

> Yep, the table is from the jboss jbpm (business process management)
schema.

I've went to that kind of test then, but it didn't help much:

  create table foo ( bar character varying(255), baz character varying(255),
    id_ bigint NOT NULL,
    class_ character(1) NOT NULL,
    version_ integer NOT NULL,
    name_ character varying(255),
    description_ character varying(4000),
    create_ timestamp without time zone,
    start_ timestamp without time zone,
    end_ timestamp without time zone,
    duedate_ timestamp without time zone,
    priority_ integer,
    iscancelled_ boolean,
    issuspended_ boolean,
    isopen_ boolean,
    issignalling_ boolean,
    isblocking_ boolean,
    task_ bigint,
    token_ bigint,
    procinst_ bigint,
    swimlaninstance_ bigint,
    taskmgmtinstance_ bigint,
    processname_ character varying(255) );

  insert into foo ( select generate_series(0, 10000000) / 1000000, case when
random() < 0.05 then 'Today Alcatel-Lucent has announced that Philippe Camus
is appointed non-executive Chairman and Ben Verwaayen is appointed Chief
Executive Officer.' else null end, 1, 'a', 1  );

  create index foobaz on foo(baz);
  create index foobar on foo(bar);
  analyze foo;

Estimated costs still look correct on my side:

  gc=# explain select * from foo where baz is null and bar in ('8') limit
15;
                                       QUERY PLAN

----------------------------------------------------------------------------
--------
   Limit  (cost=0.00..0.46 rows=15 width=1795)
     ->  Index Scan using foobar on foo  (cost=0.00..26311.70 rows=860238
width=1795)
           Index Cond: ((bar)::text = '8'::text)
           Filter: (baz IS NULL)
  (4 rows)

  gc=# set enable_indexscan = off;
  SET
  gc=# explain select * from foo where baz is null and bar in ('8') limit
15;
                                QUERY PLAN                              
  ----------------------------------------------------------------------
   Limit  (cost=0.00..3.46 rows=15 width=1795)
     ->  Seq Scan on foo  (cost=0.00..198396.62 rows=860238 width=1795)
           Filter: ((baz IS NULL) AND ((bar)::text = '8'::text))
  (3 rows)

>>Btw, it would help if you could reproduce my test scenario and
>>see if PG uses "correctly" the indexscan. It is better to try on
>>your installation, to take care of any configuration/whatever
>>variation which may create your problem.
>
> I have tried your example and I get the same results as you.
>
> db=# explain select * from foo where baz is null and bar = '8' limit 15;
>
>                                     QUERY PLAN
>
>
----------------------------------------------------------------------------
> ----
> ---
>  Limit  (cost=0.00..0.53 rows=15 width=154)
>    ->  Index Scan using foobar on foo  (cost=0.00..33159.59 rows=934389
> width=15
> 4)
>          Index Cond: (bar = 8)
>          Filter: (baz IS NULL)
> (4 rows)
>
> db=# drop index foobar;
> DROP INDEX
> db=# explain select * from foo where baz is null and bar = '8' limit 15;
>
>                              QUERY PLAN
> ---------------------------------------------------------------------
>  Limit  (cost=0.00..2.87 rows=15 width=154)
>    ->  Seq Scan on foo  (cost=0.00..178593.35 rows=934389 width=154)
>          Filter: ((baz IS NULL) AND (bar = 8))
> (3 rows)
>
> It's choosing the index because of a cost of 0.53 vs a cost of 2.87 for
> sequential scan.  I wonder why in my real tables the index scan cost is
> higher than the sequential scan cost.  Perhaps because of the extra width
of
> my rows?

You may try to crosscheck with the new test I've put upper, but
I'm skeptical :/

I think I've unfortunately more than reached my level of
incompetence on that subject, sorry I wasn't able to better
locate your problem :/

>>> From looking at the plans, it seems to be postgres is assuming it will 
>>> only
>>> have to sequentially scan 15 rows, which is not true in my case 
>>> because column B is not distributed randomly (nor will it be in 
>>> production).  Would
>>
>>Why do you say that? The explanation seems to rather tell that it
>>(correctly) assumes that the seqscan would bring up about 1M rows for the
> selected values of A and B, and then it will limit to 15 rows.
>
> I say that because the plan gives a really really low number (3.21) for
the
> estimated cost after the limit on sequential scan:
>
> Select * from JBPM_TASKINSTANCE this_ where actorid_ is null and
> this_.POOLEDACTOR_ in ('21') limit 15
> "Limit  (cost=0.00..3.21 rows=15 width=128) (actual
> time=84133.211..84187.247 rows=15 loops=1)"
> "  ->  Seq Scan on jbpm_taskinstance this_  (cost=0.00..234725.85
> rows=1095365 width=128) (actual time=84133.205..84187.186 rows=15
loops=1)"
> "        Filter: ((actorid_ IS NULL) AND ((pooledactor_)::text =
> '21'::text))"
> "Total runtime: 84187.335 ms"
>
> It just seems to me it is not taking into account at all that it might
have
> to scan thousands or millions of rows before it gets the 15 rows it needs.

Well, if your have 95% of NULL actorid_ and 10% for each value of
pooledactor_, then it makes sense to assume it will have to fetch
about 150 rows to find the 15 awaited ones...

In the end, if PG doesn't know about data distribution, its
behavior makes total sense to me: 150 rows of width=128 bytes
need only 3 disk pages, so it shouldn't be faster than with a
seqscan, theoretically; however, I am not sure then why on my
simple "foo" test it isn't using the same decision..

Btw, that should not solve your problem, but normally, to help PG
choose indexscan often enough, it's good to reduce
random_page_cost which is 4 by default (a high value for nowadays
servers), increase effective_cache_size to what's available on
your machine, and potentially the shared_buffers which normally
helps for a good deal of matters, performance-wise.

-- 
Guillaume Cottenceau, MNC Mobile News Channel SA, an Alcatel-Lucent Company
Av. de la Gare 10, 1003 Lausanne, Switzerland - direct +41 21 317 50 36