Hi,
Is there any inherent benefit of using a the IN operator versus
joining a temporary table? Should they offer near equal performance?
It appears bitmap scan's aren't done when matching across a small
temporary table.
I have a temporary table with 5 integers in it that I'm matching
against mildly complex view that has 5 joins. I've analyzed the
database after the temporary table was created.
Matching against the temporary table takes: 36492.836 ms.
Matching using the IN operator with the same content takes: 2.732 ms.
These measurements are after the query has been run a few times, so
the data should be in cache.
It would appear that the temporary table's join isn't evaluated deep
enough in the query plan to prevent the more expensive joins from
running, is there a way for force it? Could some setting be wrong
that telling the planner to make this decision? The same thing
happens when I perform the join without the view.
select * from foo;
oid
--------
161007
161008
161000
161009
161002
(5 rows)
Plan for IN match:
=# explain analyze select * from crawled_url_full_view where
crawled_url_full_view.oid in (161007, 161008, 161000, 161009, 161002);
QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
-------------------
Hash Left Join (cost=14.50..94.11 rows=5 width=538) (actual
time=1.025..1.522 rows=5 loops=1)
Hash Cond: ("outer".classification_set_id = "inner".id)
Join Filter: ("outer".classification_set_id IS NOT NULL)
-> Hash Left Join (cost=13.30..92.86 rows=5 width=526) (actual
time=0.794..1.251 rows=5 loops=1)
Hash Cond: ("outer".charset_id = "inner".id)
Join Filter: ("outer".charset_id IS NOT NULL)
-> Hash Left Join (cost=12.21..91.70 rows=5 width=515)
(actual time=0.631..1.048 rows=5 loops=1)
Hash Cond: ("outer".http_error_description_id =
"inner".id)
Join Filter: ("outer".http_error_description_id IS
NOT NULL)
-> Hash Left Join (cost=11.13..90.59 rows=5
width=472) (actual time=0.488..0.868 rows=5 loops=1)
Hash Cond: ("outer".content_type_id = "inner".id)
Join Filter: ("outer".content_type_id IS NOT NULL)
-> Nested Loop Left Join (cost=10.02..89.41
rows=5 width=443) (actual time=0.244..0.578 rows=5 loops=1)
Join Filter: ("outer".redirect_url_id IS
NOT NULL)
-> Nested Loop Left Join
(cost=10.02..59.56 rows=5 width=339) (actual time=0.225..0.488 rows=5
loops=1)
-> Bitmap Heap Scan on
crawled_url (cost=10.02..29.71 rows=5 width=235) (actual
time=0.170..0.217 rows=5 loops=1)
Recheck Cond: ((oid = 161007)
OR (oid = 161008) OR (oid = 161000) OR (oid = 161009) OR (oid = 161002))
-> BitmapOr
(cost=10.02..10.02 rows=5 width=0) (actual time=0.137..0.137 rows=0
loops=1)
-> Bitmap Index Scan
on crawled_url_pkey (cost=0.00..2.00 rows=1 width=0) (actual
time=0.061..0.061 rows=1 loops=1)
Index Cond: (oid
= 161007)
-> Bitmap Index Scan
on crawled_url_pkey (cost=0.00..2.00 rows=1 width=0) (actual
time=0.013..0.013 rows=1 loops=1)
Index Cond: (oid
= 161008)
-> Bitmap Index Scan
on crawled_url_pkey (cost=0.00..2.00 rows=1 width=0) (actual
time=0.013..0.013 rows=1 loops=1)
Index Cond: (oid
= 161000)
-> Bitmap Index Scan
on crawled_url_pkey (cost=0.00..2.00 rows=1 width=0) (actual
time=0.014..0.014 rows=1 loops=1)
Index Cond: (oid
= 161009)
-> Bitmap Index Scan
on crawled_url_pkey (cost=0.00..2.00 rows=1 width=0) (actual
time=0.012..0.012 rows=1 loops=1)
Index Cond: (oid
= 161002)
-> Index Scan using url_pkey on
url (cost=0.00..5.96 rows=1 width=108) (actual time=0.031..0.036
rows=1 loops=5)
Index Cond: (url.url_id =
"outer".url_id)
-> Index Scan using url_pkey on url r1
(cost=0.00..5.96 rows=1 width=108) (actual time=0.004..0.004 rows=0
loops=5)
Index Cond: (r1.url_id =
"outer".redirect_url_id)
-> Hash (cost=1.09..1.09 rows=9 width=33)
(actual time=0.130..0.130 rows=9 loops=1)
-> Seq Scan on content_types
(cost=0.00..1.09 rows=9 width=33) (actual time=0.017..0.062 rows=9
loops=1)
-> Hash (cost=1.06..1.06 rows=6 width=47) (actual
time=0.088..0.088 rows=6 loops=1)
-> Seq Scan on http_error_descriptions
(cost=0.00..1.06 rows=6 width=47) (actual time=0.010..0.040 rows=6
loops=1)
-> Hash (cost=1.08..1.08 rows=8 width=15) (actual
time=0.103..0.103 rows=8 loops=1)
-> Seq Scan on charsets (cost=0.00..1.08 rows=8
width=15) (actual time=0.011..0.048 rows=8 loops=1)
-> Hash (cost=1.16..1.16 rows=16 width=16) (actual
time=0.175..0.175 rows=16 loops=1)
-> Seq Scan on classification_sets (cost=0.00..1.16
rows=16 width=16) (actual time=0.012..0.088 rows=16 loops=1)
Total runtime: 2.743 ms
(41 rows)
Plan for temp table match:
=# explain analyze select * from foo, crawled_url_full_view where
crawled_url_full_view.oid = foo.oid;
QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------
Hash IN Join (cost=35667.15..145600.71 rows=5 width=538) (actual
time=22371.445..36482.823 rows=5 loops=1)
Hash Cond: ("outer".oid = "inner".oid)
-> Hash Left Join (cost=35666.09..143698.61 rows=380198
width=538) (actual time=9901.782..35218.758 rows=360531 loops=1)
Hash Cond: ("outer".classification_set_id = "inner".id)
Join Filter: ("outer".classification_set_id IS NOT NULL)
-> Hash Left Join (cost=35664.89..140493.61 rows=380198
width=526) (actual time=9901.456..32363.212 rows=360531 loops=1)
Hash Cond: ("outer".charset_id = "inner".id)
Join Filter: ("outer".charset_id IS NOT NULL)
-> Hash Left Join (cost=35663.79..135684.27
rows=380198 width=515) (actual time=9901.257..29400.189 rows=360531
loops=1)
Hash Cond: ("outer".http_error_description_id =
"inner".id)
Join Filter: ("outer".http_error_description_id
IS NOT NULL)
-> Hash Left Join (cost=35662.71..133782.19
rows=380198 width=472) (actual time=9901.080..26691.473 rows=360531
loops=1)
Hash Cond: ("outer".content_type_id =
"inner".id)
Join Filter: ("outer".content_type_id IS
NOT NULL)
-> Hash Left Join
(cost=35661.60..128972.84 rows=380198 width=443) (actual
time=9900.802..23743.323 rows=360531 loops=1)
Hash Cond: ("outer".redirect_url_id
= "inner".url_id)
Join Filter:
("outer".redirect_url_id IS NOT NULL)
-> Hash Left Join
(cost=17830.80..66680.80 rows=380198 width=339) (actual
time=4592.701..14466.994 rows=360531 loops=1)
Hash Cond: ("outer".url_id =
"inner".url_id)
-> Seq Scan on crawled_url
(cost=0.00..10509.98 rows=380198 width=235) (actual
time=0.026..2976.911 rows=360531 loops=1)
-> Hash
(cost=10627.04..10627.04 rows=377104 width=108) (actual
time=4591.703..4591.703 rows=382149 loops=1)
-> Seq Scan on url
(cost=0.00..10627.04 rows=377104 width=108) (actual
time=0.041..2142.702 rows=382149 loops=1)
-> Hash (cost=10627.04..10627.04
rows=377104 width=108) (actual time=5307.540..5307.540 rows=382149
loops=1)
-> Seq Scan on url r1
(cost=0.00..10627.04 rows=377104 width=108) (actual
time=0.138..2503.577 rows=382149 loops=1)
-> Hash (cost=1.09..1.09 rows=9
width=33) (actual time=0.144..0.144 rows=9 loops=1)
-> Seq Scan on content_types
(cost=0.00..1.09 rows=9 width=33) (actual time=0.020..0.068 rows=9
loops=1)
-> Hash (cost=1.06..1.06 rows=6 width=47)
(actual time=0.108..0.108 rows=6 loops=1)
-> Seq Scan on http_error_descriptions
(cost=0.00..1.06 rows=6 width=47) (actual time=0.015..0.049 rows=6
loops=1)
-> Hash (cost=1.08..1.08 rows=8 width=15) (actual
time=0.129..0.129 rows=8 loops=1)
-> Seq Scan on charsets (cost=0.00..1.08
rows=8 width=15) (actual time=0.014..0.058 rows=8 loops=1)
-> Hash (cost=1.16..1.16 rows=16 width=16) (actual
time=0.234..0.234 rows=16 loops=1)
-> Seq Scan on classification_sets (cost=0.00..1.16
rows=16 width=16) (actual time=0.014..0.107 rows=16 loops=1)
-> Hash (cost=1.05..1.05 rows=5 width=4) (actual
time=0.092..0.092 rows=5 loops=1)
-> Seq Scan on foo (cost=0.00..1.05 rows=5 width=4)
(actual time=0.022..0.044 rows=5 loops=1)
Total runtime: 36492.836 ms
(35 rows)
Definition of the view:
create view crawled_url_full_view as
select crawled_url.*,
url.url,
r1.url as redirect_url,
content_types.type as content_type,
http_error_descriptions.error as http_error_description,
charsets.name as charset,
classification_sets.name as classification_set
from crawled_url left join url on url.url_id = crawled_url.url_id
left join url as r1 on (r1.url_id = crawled_url.redirect_url_id and
crawled_url.redirect_url_id is not null)
left join content_types on (content_types.id =
crawled_url.content_type_id and crawled_url.content_type_id is not null)
left join http_error_descriptions on (http_error_descriptions.id =
crawled_url.http_error_description_id and
crawled_url.http_error_description_id is not null)
left join charsets on (charsets.id = crawled_url.charset_id and
crawled_url.charset_id is not null)
left join classification_sets on (classification_sets.id =
crawled_url.classification_set_id and
crawled_url.classification_set_id is not null);
Version is: PostgreSQL 8.1.4 on i686-pc-linux-gnu, compiled by GCC
gcc (GCC) 4.0.1 20050727 (Red Hat 4.0.1-5)
work_mem=30000
shared_buffers=5000
effective_cache_size=15000
Thanks for any help,
Rusty
--
Rusty Conover
InfoGears Inc.
Web: http://www.infogears.com