I'm using 8.3 and I have a table that contains many revisions of the same entity and I have a query that is super slow. Please help! I'm going to paste in some SQL to set up test cases and some plans below. If that isn't the right way to post to this list please let me know and I'll revise.
My table looks kind of like this but wider:
CREATE TEMPORARY TABLE test (revision SERIAL NOT NULL PRIMARY KEY, a INTEGER NOT NULL, b INTEGER NOT NULL, c INTEGER NOT NULL);
INSERT INTO test (a, b, c) SELECT a, 1, 25 FROM generate_series(1, 100000) AS t1(a), generate_series(1, 10) as t2(b);
CREATE INDEX test_a ON test (a);
ANALYZE test;
I need to SELECT all the columns with the latest revision for a subset of As. What is the right way to do this quickly?
When I do it like this:
CREATE TEMPORARY TABLE request (a INTEGER NOT NULL);
INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a);
ANALYZE request;
SELECT *
FROM request
JOIN (SELECT a, MAX(b) as b FROM test GROUP BY a) max USING (a)
JOIN test USING (a, b);
DROP TABLE request;
The plan for the SELECT is pretty bad:
"Hash Join (cost=32792.50..77907.29 rows=62288 width=20) (actual time=769.570..2222.050 rows=199 loops=1)"
" Hash Cond: ((max(pg_temp_7.test.revision)) = pg_temp_7.test.revision)"
" -> Hash Join (cost=5.48..38659.23 rows=62288 width=8) (actual time=20.621..830.235 rows=199 loops=1)"
" Hash Cond: (pg_temp_7.test.a = request.a)"
" -> GroupAggregate (cost=0.00..37170.11 rows=62601 width=8) (actual time=16.847..808.475 rows=100000 loops=1)"
" -> Index Scan using test_a on test (cost=0.00..31388.04 rows=999912 width=8) (actual time=16.826..569.035 rows=1000000 loops=1)"
" -> Hash (cost=2.99..2.99 rows=199 width=4) (actual time=3.736..3.736 rows=199 loops=1)"
" -> Seq Scan on request (cost=0.00..2.99 rows=199 width=4) (actual time=3.658..3.689 rows=199 loops=1)"
" -> Hash (cost=15405.12..15405.12 rows=999912 width=16) (actual time=723.673..723.673 rows=1000000 loops=1)"
" -> Seq Scan on test (cost=0.00..15405.12 rows=999912 width=16) (actual time=0.006..290.313 rows=1000000 loops=1)"
"Total runtime: 2222.267 ms"
If I instead issue the query as:
CREATE TEMPORARY TABLE request (a INTEGER NOT NULL, revision INTEGER);
INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a);
UPDATE request SET revision = (SELECT MAX(revision) FROM test WHERE request.a = test.a);
ANALYZE request;
SELECT *
FROM request
JOIN test USING (revision)
DROP TABLE request;
The whole thing runs tons faster. The UPDATE uses the right index and is way sub second and the SELECT's plan is fine:
"Merge Join (cost=11.66..76.09 rows=199 width=20) (actual time=0.131..0.953 rows=199 loops=1)"
" Merge Cond: (test.revision = request.revision)"
" -> Index Scan using test_pkey on test (cost=0.00..31388.04 rows=999912 width=16) (actual time=0.017..0.407 rows=2001 loops=1)"
" -> Sort (cost=11.59..12.09 rows=199 width=8) (actual time=0.102..0.133 rows=199 loops=1)"
" Sort Key: request.revision"
" Sort Method: quicksort Memory: 34kB"
" -> Seq Scan on request (cost=0.00..3.99 rows=199 width=8) (actual time=0.020..0.050 rows=199 loops=1)"
"Total runtime: 1.005 ms"
Am I missing something or is this really the best way to do this in 8.3?
Thanks for slogging through all this,
Nik Everett