Re: limit clause breaks query planner?

Guillaume Cottenceau <gc@xxxxxx> · Tue, 02 Sep 2008 12:15:57 +0200

"David West" <david.west 'at' cusppoint.com> writes:

> Yes I inserted values in big batches according to a single value of 'B', so
> indeed a sequence scan may have to scan forward many millions of rows before
> finding the required value.

That may well be why the seqscan is so slow to give your results;
that said, it doesn't explain why the indexscane is not
preferred.

> I have been doing regular analyse commands on my table.  I don't think my

Like, recently? Can you post the stats?

gc=# select * from pg_stats where tablename = 'foo';

You should try to ANALYZE again and see if that makes a
difference, to be sure.

> table is bloated, I haven't been performing updates.  However I'm doing a

Maybe you've been DELETE'ing then INSERT'ing some? That creates
bloat too. Btw, don't forget to prefer TRUNCATE to remove
everything from the table, and ANALYZE after large INSERT's.

> vacuum analyse now and I'll see if that makes any difference.

A single VACUUM may not report how bloated your table is, if it's
been VACUUM'ed some before, but not frequently enough. If you
have time for it, and you can afford a full lock on the table,
only a VACUUM FULL VERBOSE will tell you the previous bloat (the
"table .. truncated to .." line IIRC).

> I am using postgres 8.3.1 with a default install on windows - no tweaks to
> the configuration at all.

With a default install, the free space map settings may well be
too small for tracking free space on a table as large as 10M
rows. Performing VACUUM VERBOSE on database 'template1' will show
you interesting information about the current and ideal FSM
settings, at the end of the output. Something like:

   INFO:  free space map contains 37709 pages in 276 relations
   DETAIL:  A total of 42080 page slots are in use (including overhead).
   42080 page slots are required to track all free space.
   Current limits are:  204800 page slots, 1000 relations, using 1265 kB.

Of course, this also depends on the frequency of your VACUUMing
(if autovacuuming is not configured or badly configured) against
the frequency of your UPDATE's and DELETE's.

> There are many other columns in my table, but none of them are used in this
> query.

Can you show us the table definition? I am too ignorant in PG to
know if that would make a difference, but it might ring a bell
for others.. AFAIK, more column data may mean larger resultsets
and may change the plan (though "width=128" in the log of your
explanation wouldn't mean a lot of data are stored per row).

> Guillaume in your example you didn't add the limit clause?  Postgres chooses
> the correct index in my case without the limit clause, the problem is with
> the limit clause.

Duh, forgot about that, sorry! But I did try it and it was the same.

gc=# explain select * from foo where baz is null and bar = '8' limit 15;
                                     QUERY PLAN                                     
------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.42 rows=15 width=154)
   ->  Index Scan using foobar on foo  (cost=0.00..30398.66 rows=1079089 width=154)
         Index Cond: (bar = 8)
         Filter: (baz IS NULL)
(4 rows)

> One other difference with your example is both my columns are
> varchar columns, not integer and text, I don't know if that
> would make a difference.

It is always useful to know as much about the actual table
definition and data, to isolate a performance problem... I know
it may clash with privacy :/ but that kind of information
probably will not, isn't it?

With:

gc=# create table foo ( bar varchar(64), baz varchar(256) );

it doesn't make a difference yet:

gc=# explain select * from foo where baz is null and bar = '8';
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Index Scan using foobar on foo  (cost=0.00..27450.05 rows=982092 width=149)
   Index Cond: ((bar)::text = '8'::text)
   Filter: (baz IS NULL)
(3 rows)

gc=# explain select * from foo where baz is null and bar = '8' limit 15;
                                    QUERY PLAN                                     
-----------------------------------------------------------------------------------
 Limit  (cost=0.00..0.42 rows=15 width=149)
   ->  Index Scan using foobar on foo  (cost=0.00..27450.05 rows=982092 width=149)
         Index Cond: ((bar)::text = '8'::text)
         Filter: (baz IS NULL)
(4 rows)

Btw, it would help if you could reproduce my test scenario and
see if PG uses "correctly" the indexscan. It is better to try on
your installation, to take care of any configuration/whatever
variation which may create your problem.

>>From looking at the plans, it seems to be postgres is assuming it will only
> have to sequentially scan 15 rows, which is not true in my case because
> column B is not distributed randomly (nor will it be in production).  Would

Why do you say that? The explanation seems to rather tell that it
(correctly) assumes that the seqscan would bring up about 1M rows
for the selected values of A and B, and then it will limit to 15
rows.

> postgres not be best to ignore the limit when deciding the best index to use
> -  in this simple query wouldn't the best plan to use always be the same
> with or without a limit?

I am not too sure, but I'd say no: when PG considers the LIMIT,
then it knows that (potentially) less rows are to be actually
used from the inner resultset, so a different plan may be
devised.

-- 
Guillaume Cottenceau, MNC Mobile News Channel SA, an Alcatel-Lucent Company
Av. de la Gare 10, 1003 Lausanne, Switzerland - direct +41 21 317 50 36