Re: Planner cost adjustments

Daniel Begin <jfd553@xxxxxxxxxxx> · Tue, 2 Jun 2015 14:01:35 -0400

Here is a follow-up on the step-by-step procedure proposed by PT

#1 - setup postgresql planner's cost estimate settings for my hardware. 
----------------------------------------------------------------------------
----------
Current parameters values described in section 18.7.2 haven't been changed
except for the effective_cache_size
seq_page_cost = 1
random_page_cost = 4
cpu_tuple_cost = 0.01
cpu_index_tuple_cost = 0.005
cpu_operator_cost = 0.0025 
effective_cache_size = 10GB

I did a bunch of tests on frequently used queries to see how well they
perform - using SET enable_seqscan = ON/OFF. 
As described earlier in this tread, the planner use Seq Scan on tables even
if using an Index Scan is in this case 5 times faster!
Here are the logs of EXPLAIN ANALYSE on a query...

osmdump=# SET enable_seqscan = ON;
osmdump=# EXPLAIN ANALYSE SELECT user_id, id AS changeset_id,closed FROM
changesets WHERE changesets.user_id IN(SELECT id FROM sample.users);
----------------------------------------------------------------------------
-----------------------------------------------------
 Hash Semi Join  (cost=21.50..819505.27 rows=726722 width=24) (actual
time=1574.914..7444.938 rows=338568 loops=1)
   Hash Cond: (changesets.user_id = users.id)
   ->  Seq Scan on changesets  (cost=0.00..745407.22 rows=25139722 width=24)
(actual time=0.002..4724.578 rows=25133929 loops=1)
   ->  Hash  (cost=14.00..14.00 rows=600 width=8) (actual time=0.165..0.165
rows=600 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 24kB
         ->  Seq Scan on users  (cost=0.00..14.00 rows=600 width=8) (actual
time=0.003..0.073 rows=600 loops=1)
 Total runtime: 7658.715 ms
(7 rows)

osmdump=# SET enable_seqscan = OFF;
osmdump=# EXPLAIN ANALYSE SELECT user_id, id AS changeset_id,closed FROM
changesets WHERE changesets.user_id IN(SELECT id FROM sample.users);
----------------------------------------------------------------------------
--------------------------------------------------------------------
 Nested Loop  (cost=10000000015.94..10001072613.45 rows=726722 width=24)
(actual time=0.268..1490.515 rows=338568 loops=1)
   ->  HashAggregate  (cost=10000000015.50..10000000021.50 rows=600 width=8)
(actual time=0.207..0.531 rows=600 loops=1)
         ->  Seq Scan on users  (cost=10000000000.00..10000000014.00
rows=600 width=8) (actual time=0.003..0.037 rows=600 loops=1)
   ->  Index Scan using changesets_useridndx on changesets
(cost=0.44..1775.54 rows=1211 width=24) (actual time=0.038..2.357 rows=564
loops=600
         Index Cond: (user_id = users.id)
 Total runtime: 1715.517 ms
(6 rows)

#2 - Run ANALYZE DATABASE and look at performance/planning improvement.
----------------------------------------------------------------------------
----------
I ran ANALYZE DATABASE then rerun the query. It did not produce any
significant improvement according to the EXPLAIN ANALYSE below...

osmdump=# SET enable_seqscan = ON;
osmdump=# EXPLAIN ANALYSE SELECT user_id, id AS changeset_id,closed FROM
changesets WHERE changesets.user_id IN(SELECT id FROM sample.users);
----------------------------------------------------------------------------
-----------------------------------------------------
 Hash Semi Join  (cost=21.50..819511.42 rows=729133 width=24) (actual
time=1538.100..7307.743 rows=338568 loops=1)
   Hash Cond: (changesets.user_id = users.id)
   ->  Seq Scan on changesets  (cost=0.00..745390.84 rows=25138084 width=24)
(actual time=0.027..4620.691 rows=25133929 loops=1)
   ->  Hash  (cost=14.00..14.00 rows=600 width=8) (actual time=0.300..0.300
rows=600 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 24kB
         ->  Seq Scan on users  (cost=0.00..14.00 rows=600 width=8) (actual
time=0.022..0.187 rows=600 loops=1)
 Total runtime: 7519.254 ms
(7 rows)

osmdump=# SET enable_seqscan = OFF;
osmdump=# EXPLAIN ANALYSE SELECT user_id, id AS changeset_id,closed FROM
changesets WHERE changesets.user_id IN(SELECT id FROM sample.users);
----------------------------------------------------------------------------
--------------------------------------------------------------------
 Nested Loop  (cost=10000000015.94..10001090810.49 rows=729133 width=24)
(actual time=0.268..1466.248 rows=338568 loops=1)
   ->  HashAggregate  (cost=10000000015.50..10000000021.50 rows=600 width=8)
(actual time=0.205..0.530 rows=600 loops=1)
         ->  Seq Scan on users  (cost=10000000000.00..10000000014.00
rows=600 width=8) (actual time=0.003..0.035 rows=600 loops=1)
   ->  Index Scan using changesets_useridndx on changesets
(cost=0.44..1805.83 rows=1215 width=24) (actual time=0.036..2.314 rows=564
loops=600)
         Index Cond: (user_id = users.id)
 Total runtime: 1677.447 ms
(6 rows)

#3 - Run EXPLAIN ANALYZE and look for discrepancies between the estimated
and actual times
----------------------------------------------------------------------------
----------
Looking at above results, there are obvious discrepancies between
expected/actual rows and time!
I dug a bit by exploring/trying to understand the different concepts
explained in...

http://www.postgresql.org/docs/9.4/static/planner-stats.html
http://www.postgresql.org/docs/8.1/static/planner-stats-details.html
http://www.postgresql.org/docs/9.2/static/view-pg-stats.html

Concerning discrepancies between the actual number of rows and predicted
value, I looked at what pg_stats was saying about user_id in table
changesets.
Here are the values provided to the planner...
Average_width=8
histogram_bounds: the size of the bins varies between 50 and 150000, which
make sense because if I had divided the column's values into groups of
approximately equal population, I would have produced bins between 1 and
100000 (if sorted by frequency)
n_distinct= 20686 (there is actually 464858 distinct values for user_id in
the table)
most_common_vals: values make sense (I checked the frequency count of a
couple most common users_id)
correlation=0.617782 (?)
most_common_elems, most_common_elem_freqs and elem_count_histogram were
empty

At this point, I wonder if the assumptions behind the planner's statistics
may produce such problems since the distribution of my data is not uniform
but follows a power law (some user_id would return millions of records while
others only one).  
This is the farthest I can go at this point. Maybe someone can provide me
with more explanations regarding planner's behavior and ways to go further
to make it work properly?

Best regards,
Daniel

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general