Cluster vs. non-cluster query planning

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'm running postgres 8.0.7, and I've got a table of orders with about 100,000 entries. I want to just look at the new orders, right now 104 of them.

EXPLAIN ANALYZE
SELECT
   order_id
   FROM
       orders
   WHERE
order_statuses_id = (SELECT id FROM order_statuses WHERE id_name = 'new');

Seq Scan on orders o (cost=1.20..11395.51 rows=7029 width=8) (actual time=286.038..287.662 rows=104 loops=1)
  Filter: (order_statuses_id = $0)
  InitPlan
-> Seq Scan on order_statuses (cost=0.00..1.20 rows=1 width=4) (actual time=0.030..0.039 rows=1 loops=1)
          Filter: ((id_name)::text = 'new'::text)
Total runtime: 288.102 ms

The dreaded sequential scan. I've got an index on order_statuses_id and I've VACUUM ANALYZEd the table, but I'm currently clustered on the primary key (order_id).

With enable_seqscan = off, I get:
-------------------------------------------------
Index Scan using orders_status_btree_idx on orders o (cost=4.64..12457.14 rows=7031 width=8) (actual time=0.164..0.664 rows=104 loops=1)
  Index Cond: (order_statuses_id = $0)
  InitPlan
-> Index Scan using order_statuses_id_name_key on order_statuses (cost=0.00..4.64 rows=1 width=4) (actual time=0.128..0.134 rows=1 loops=1)
          Index Cond: ((id_name)::text = 'new'::text)
Total runtime: 1.108 ms

If I hard-code the 'new' status ID, I get:
-------------------------------------------------
EXPLAIN ANALYZE
SELECT
   order_id
   FROM
       orders
   WHERE
       order_statuses_id = 1;

Index Scan using orders_status_btree_idx on orders o (cost=0.00..4539.65 rows=1319 width=8) (actual time=0.132..1.883 rows=104 loops=1)
  Index Cond: (order_statuses_id = 1)
Total runtime: 2.380 ms

Here is the pg_stats entry for orders.order_statuses_id:
schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation
------------+-----------+-------------------+-------------+-----------+------------+------------------+--------------------------------------+-------------------------+-------------
public | orders | order_statuses_id | 0.000208333 | 4 | 14 | {8,24,10,25} | {0.385417,0.242083,0.230625,0.07875} | {1,7,7,9,9,9,9,9,23,26} | 0.740117

This is with SET STATISTICS = 16 on the column, since that's how many different values the column can currently take.

Now, here's the thing - if I cluster on the index on order_statuses_id, the original query produces: Index Scan using orders_status_btree_idx on orders o (cost=1.20..978.94 rows=8203 width=8) (actual time=0.097..0.598 rows=104 loops=1)
  Index Cond: (order_statuses_id = $0)
  InitPlan
-> Seq Scan on order_statuses (cost=0.00..1.20 rows=1 width=4) (actual time=0.056..0.065 rows=1 loops=1)
          Filter: ((id_name)::text = 'new'::text)
Total runtime: 1.042 ms

Estimated cost went way down. The pg_stats entry becomes:

schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation
------------+-----------+-------------------+-----------+-----------+------------+------------------+----------------------------------------+---------------------+-------------
public | orders | order_statuses_id | 0 | 4 | 12 | {8,24,10,25} | {0.386458,0.244167,0.238333,0.0720833} | {1,7,7,9,9,9,22,26} | 1

I'm hesitant to cluster on the order_statuses_id index, because there are a lot of other queries using this table, many of which join on order_id. I also feel like I ought to be able to get the planner to do an index scan without hard-coding the order_statuses_id value.

Questions:
* What can I do to reduce the estimated row count on the query?
* Why does clustering drive down the estimated cost for the index scan so much? Does a change in correlation from .72 to 1 make that much of a difference? * Can I convince my query planner to index scan without clustering on the order_statuses_id index, or setting enable_seqscan = off?

Potential note of interest: This is a very wide, monolithic table - no less than 100 columns, with several check constraints, foreign key constraints, and indexes, including three functional indexes.

Side question: Sometimes, when I VACUUM ANALYZE the table, the pg_stats entry for order_statuses_id has almost all of the possible values in most_common_vals, instead of just a handful. Example:

schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation
------------+-----------+-------------------+-----------+-----------+------------+-----------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------+-------------
public | orders | order_statuses_id | 0 | 4 | 13 | {8,24,10,25,9,7,23,26,1,22,2,5,4} | {0.393125,0.240208,0.226042,0.07875,0.0275,0.0145833,0.0110417,0.00291667,0.00229167,0.001875,0.000625,0.000625,0.000416667} | | 1

This doesn't appear to influence whether the index scan is chosen, but I am curious as to why this happens.


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux