On Mar 5, 2007, at 8:54 PM, Tom Lane wrote:
Hm, the cost for the upper nestloop is way less than you would expect
given that the HASH IN join is going to have to be repeated 100+
times.
I think this must be due to a very low "join_in_selectivity" estimate
but I'm not sure why you are getting that, especially seeing that the
rowcount estimates aren't far off. Can you show us the pg_stats
rows for symptoms.id and symptom_reports.symptom_id?
Hi Tom, thanks for the response. Here are the pg_stats. I think I
understand what the stats say, but I don't know what to conclude from
them.
plm_stage=# select * from pg_stats where tablename = 'symptoms' and
attname = 'id';
schemaname | tablename | attname | null_frac | avg_width | n_distinct
| most_common_vals | most_common_freqs |
histogram_bounds | correlation
------------+-----------+---------+-----------+-----------
+------------+------------------+-------------------
+-------------------------------------+-------------
public | symptoms | id | 0 | 4 | -1
| | |
{1,11,24,34,46,57,71,85,95,106,117} | 0.451606
plm_stage=# select * from pg_stats where tablename =
'symptom_reports' and attname = 'symptom_id';
schemaname | tablename | attname | null_frac | avg_width |
n_distinct | most_common_vals
|
most_common_freqs |
histogram_bounds | correlation
------------+-----------------+------------+-----------+-----------
+------------+------------------------
+-----------------------------------------------------------------------
---------------+-------------------------------------+-------------
public | symptom_reports | symptom_id | 0 | 4
| 80 | {3,2,4,1,5,8,9,7,10,6} |
{0.094,0.0933333,0.0933333,0.092,0.0913333,0.0903333,0.0866667,0.0843333
,0.084,0.08} | {12,18,24,30,38,44,51,57,91,91,114} | 0.0955925
And Ismo, I followed your suggestion to re-write the SQL more
cleanly, and you are right it was faster, so that is certainly a
solution. Although I am still curious why my original query slowed
down after the vacuum analyze. In any case, here is the explain
analyze from the new query. Compare that to the 3441.452 ms of the
old query after the analyze (and 134.045 ms before the analyze):
plm_stage=# explain analyze SELECT count(distinct s.id) AS count_all
FROM symptoms s ,symptom_reports sr,users u WHERE s.id=sr.symptom_id
and sr.user_id=u.id and u.disease_id in (1);
QUERY
PLAN
------------------------------------------------------------------------
-----------------------------------------------------------------
Aggregate (cost=770.05..770.06 rows=1 width=4) (actual
time=176.749..176.751 rows=1 loops=1)
-> Hash Join (cost=89.43..737.50 rows=13020 width=4) (actual
time=7.762..142.063 rows=13038 loops=1)
Hash Cond: ("outer".symptom_id = "inner".id)
-> Hash Join (cost=86.09..538.86 rows=13020 width=4)
(actual time=7.277..89.293 rows=13038 loops=1)
Hash Cond: ("outer".user_id = "inner".id)
-> Seq Scan on symptom_reports sr
(cost=0.00..257.38 rows=13038 width=8) (actual time=0.003..30.499
rows=13038 loops=1)
-> Hash (cost=82.41..82.41 rows=1471 width=4)
(actual time=7.261..7.261 rows=1471 loops=1)
-> Seq Scan on users u (cost=0.00..82.41
rows=1471 width=4) (actual time=0.006..4.133 rows=1471 loops=1)
Filter: (disease_id = 1)
-> Hash (cost=3.07..3.07 rows=107 width=4) (actual
time=0.469..0.469 rows=107 loops=1)
-> Seq Scan on symptoms s (cost=0.00..3.07 rows=107
width=4) (actual time=0.007..0.247 rows=107 loops=1)
Total runtime: 176.842 ms
(12 rows)