Hi all!
While evaluting the pgsql query planer I found some weird behavior of
the query planer. I think it's plan is way too complex and could much
faster?
CREATE TABLE friends (
id integer NOT NULL,
user_id integer NOT NULL,
ref_id integer NOT NULL,
);
ALTER TABLE ONLY friends ADD CONSTRAINT friends_pkey PRIMARY KEY (id);
CREATE INDEX user_ref ON friends USING btree (user_id, ref_id);
I fill this table with around 2.800.000 random rows (values between 1
and 500.000 for user_id, ref_id).
The intention of the query is to find rows with no "partner" row. The
offset and limit are just to ignore the time needed to send the result
to the client.
SELECT * FROM friends AS f1 WHERE NOT EXISTS (SELECT 1 FROM friends AS
f2 WHERE f1.user_id=f2.ref_id AND f1.ref_id=f2.user_id) OFFSET 1000000
LIMIT 1
Mysql uses this query plan:
1 PRIMARY f1 index NULL user_ref 8 NULL
2818860 Using where; Using index
2 DEPENDENT SUBQUERY f2 ref user_ref user_ref 8
f1.ref_id,f1.user_id 1 Using index
Time: 9.8s
Postgre uses this query plan:
"Limit (cost=66681.50..66681.50 rows=1 width=139) (actual
time=7413.489..7413.489 rows=1 loops=1)"
" -> Merge Anti Join (cost=40520.17..66681.50 rows=367793 width=139)
(actual time=3705.078..7344.256 rows=1000001 loops=1)"
" Merge Cond: ((f1.user_id = f2.ref_id) AND (f1.ref_id =
f2.user_id))"
" -> Index Scan using user_ref on friends f1
(cost=0.00..26097.86 rows=2818347 width=139) (actual
time=0.093..1222.592 rows=1917360 loops=1)"
" -> Materialize (cost=40520.17..40555.40 rows=2818347 width=8)
(actual time=3704.977..5043.347 rows=1990148 loops=1)"
" -> Sort (cost=40520.17..40527.21 rows=2818347 width=8)
(actual time=3704.970..4710.703 rows=1990148 loops=1)"
" Sort Key: f2.ref_id, f2.user_id"
" Sort Method: external merge Disk: 49576kB"
" -> Seq Scan on friends f2 (cost=0.00..18143.18
rows=2818347 width=8) (actual time=0.015..508.797 rows=2818347 loops=1)"
"Total runtime: 7422.516 ms"
It's already faster, which is great, but I wonder why the query plan is
that complex.
I read in the pqsql docs that using a multicolumn key is almost never
needed and only a waste of cpu/space. So I dropped the multicolumn key
and added to separate keys instead:
CREATE INDEX ref1 ON friends USING btree (ref_id);
CREATE INDEX user1 ON friends USING btree (user_id);
New query plan:
"Limit (cost=70345.04..70345.04 rows=1 width=139) (actual
time=43541.709..43541.709 rows=1 loops=1)"
" -> Merge Anti Join (cost=40520.27..70345.04 rows=367793 width=139)
(actual time=3356.694..43467.818 rows=1000001 loops=1)"
" Merge Cond: (f1.user_id = f2.ref_id)"
" Join Filter: (f1.ref_id = f2.user_id)"
" -> Index Scan using user1 on friends f1 (cost=0.00..26059.79
rows=2818347 width=139) (actual time=0.031..1246.668 rows=1917365 loops=1)"
" -> Materialize (cost=40520.17..40555.40 rows=2818347 width=8)
(actual time=3356.615..14941.405 rows=130503729 loops=1)"
" -> Sort (cost=40520.17..40527.21 rows=2818347 width=8)
(actual time=3356.611..4127.435 rows=1990160 loops=1)"
" Sort Key: f2.ref_id"
" Sort Method: external merge Disk: 49560kB"
" -> Seq Scan on friends f2 (cost=0.00..18143.18
rows=2818347 width=8) (actual time=0.012..496.174 rows=2818347 loops=1)"
"Total runtime: 43550.187 ms"
As one can see it's much much slower and only uses one key, not both. I
thought performance should be almost equal.
I also wonder why it makes a difference when adding a "LIMIT" clause to
the subselect in an EXISTS subselect. Shouldn't pgsql always stop after
finding the a row? In mysql is makes no difference in speed, pgsql even
get's slower when adding a LIMIT to the EXISTS subselect (I hoped it
would get faster?!).
SELECT * FROM friends AS f1 WHERE NOT EXISTS (SELECT 1 FROM friends AS
f2 WHERE f1.user_id=f2.ref_id AND f1.ref_id=f2.user_id LIMIT 1) OFFSET
1000000 LIMIT 1
"Limit (cost=6389166.19..6389172.58 rows=1 width=139) (actual
time=54540.356..54540.356 rows=1 loops=1)"
" -> Seq Scan on friends f1 (cost=0.00..9003446.87 rows=1409174
width=139) (actual time=0.511..54460.006 rows=1000001 loops=1)"
" Filter: (NOT (SubPlan 1))"
" SubPlan 1"
" -> Limit (cost=2.18..3.19 rows=1 width=0) (actual
time=0.029..0.029 rows=0 loops=1832284)"
" -> Bitmap Heap Scan on friends f2 (cost=2.18..3.19
rows=1 width=0) (actual time=0.028..0.028 rows=0 loops=1832284)"
" Recheck Cond: (($0 = ref_id) AND ($1 = user_id))"
" -> BitmapAnd (cost=2.18..2.18 rows=1 width=0)
(actual time=0.027..0.027 rows=0 loops=1832284)"
" -> Bitmap Index Scan on ref1
(cost=0.00..1.09 rows=75 width=0) (actual time=0.011..0.011 rows=85
loops=1832284)"
" Index Cond: ($0 = ref_id)"
" -> Bitmap Index Scan on user1
(cost=0.00..1.09 rows=87 width=0) (actual time=0.011..0.011 rows=87
loops=1737236)"
" Index Cond: ($1 = user_id)"
"Total runtime: 54540.431 ms"
As in my previous tests, this is only a testing environment: so all data
is in memory, no disk activity involved at all, no swap etc.
Thanks,
Corin
BTW: I'll respond to the answers of my previous post later.
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance