Hello Merlin,
So far the improvement in responsiveness has been very noticeable, even
without indexing the temporary tables. Of course, this is just trading
accuracy for speed as I simply narrow arbitrarily the search space ...
The schema I am working on is close to the one I am referencing in this
thread : http://archives.postgresql.org/pgsql-general/2010-11/msg00272.php
Since we want to implement full text search and tags querying, it can
lead to rather complex autogenerated queries such as this one (find log
lines with the word 'root' in it, dated from 11/04 to 11/06, where the
'program' tag is sshd and the 'severity_code' tag is less than 3) :
EXPLAIN ANALYZE SELECT r.logid FROM (((SELECT lcond84.logid,
lcond84.date FROM
( SELECT tmp84.logid, tmp84.date FROM logs tmp84
WHERE TRUE AND ( to_tsvector('simple',tmp84.body) @@
plainto_tsquery('simple','root') ) AND tmp84.date > '2010-11-04
10:22:06.26' AND tmp84.date < '2010-11-06 10:22:06.26' ) AS lcond84
NATURAL JOIN
( SELECT tmp85.logid FROM tags tmp85 WHERE
FALSE
OR (tmp85.name='severity_code' AND num_lt(tmp85.value, 3) )
OR (tmp85.name='program' AND tmp85.value = CAST(ROW('sshd') AS tagvalue) )
GROUP BY tmp85.logid HAVING COUNT(tmp85.logid) = 2 ) AS tcond84 )
)) AS r ORDER BY r.date DESC LIMIT 1000;
Giving the following query plan :
Limit (cost=765445.54..765445.56 rows=9 width=16) (actual
time=34744.257..34744.257 rows=0 loops=1)
-> Sort (cost=765445.54..765445.56 rows=9 width=16) (actual
time=34744.255..34744.255 rows=0 loops=1)
Sort Key: tmp84.date
Sort Method: quicksort Memory: 17kB
-> Hash Join (cost=765005.46..765445.40 rows=9 width=16)
(actual time=34744.202..34744.202 rows=0 loops=1)
Hash Cond: (tmp85.logid = tmp84.logid)
-> HashAggregate (cost=758440.29..758669.77 rows=15299
width=8) (actual time=33343.816..33343.816 rows=0 loops=1)
Filter: (count(tmp85.logid) = 2)
-> Bitmap Heap Scan on tags tmp85
(cost=92363.26..757225.45 rows=242968 width=8) (actual
time=20676.354..33294.252 rows=32864 loops=1)
Recheck Cond: ((name =
'severity_code'::text) OR (name = 'program'::text))
Filter: (((name = 'severity_code'::text) AND
num_lt(value, 3::double precision)) OR ((name = 'program'::text) AND
((value).storedvalue = 'sshd'::text)))
-> BitmapOr (cost=92363.26..92363.26
rows=2148625 width=0) (actual time=19688.915..19688.915 rows=0 loops=1)
-> Bitmap Index Scan on nameval_idx
(cost=0.00..30388.35 rows=707841 width=0) (actual
time=19337.358..19337.358 rows=708719 loops=1)
Index Cond: (name =
'severity_code'::text)
-> Bitmap Index Scan on nameval_idx
(cost=0.00..61853.42 rows=1440784 width=0) (actual time=351.551..351.551
rows=1484703 loops=1)
Index Cond: (name = 'program'::text)
-> Hash (cost=6553.06..6553.06 rows=969 width=16)
(actual time=1400.378..1400.378 rows=32516 loops=1)
-> Bitmap Heap Scan on logs tmp84
(cost=215.51..6553.06 rows=969 width=16) (actual time=1104.226..1383.745
rows=32516 loops=1)
Recheck Cond:
(to_tsvector('simple'::regconfig, body) @@ '''root'''::tsquery)
Filter: ((date > '2010-11-04
10:22:06.26+01'::timestamp with time zone) AND (date < '2010-11-06
10:22:06.26+01'::timestamp with time zone))
-> Bitmap Index Scan on fulltext_body_idx
(cost=0.00..215.27 rows=1740 width=0) (actual time=1097.874..1097.874
rows=64340 loops=1)
Index Cond:
(to_tsvector('simple'::regconfig, body) @@ '''root'''::tsquery)
Total runtime: 34756.938 ms
This one isn't too bad, but the runtime seems to increase exponentially
with the tables size. Therefore, using a temporary table based on the
date condition can cut the query time by a factor of up to ten (table
creation included, and provided the resulting table isn't too big - I
make a COUNT check prior to creation so that I will eventually limit
manually the table size.). But of course, I'd rather have speed AND
accuracy ...
To make things worse, the tables tend to grow very quickly since as you
might have guessed, I am working on the database part of a logs
collector; the current implementation doesn't scale well along the data.
I hope this makes things clearer. Feel free to ask if you need more
clarifications, and thanks for your time.
Matthieu
Le 08/11/2010 22:26, Merlin Moncure a écrit :
On Mon, Nov 8, 2010 at 12:15 PM, Matthieu Huin<matthieu.huin@xxxxxxxxxx> wrote:
Greetings all,
I am trying to optimize SELECT queries on a large table (10M rows and more)
by using temporary tables that are subsets of my main table, thus narrowing
the search space to a more manageable size.
Is it possible to transfer indices (or at least use the information from
existing indices) from the big table to its subset in a reasonable amount of
time ?
Are you sure that the benefit of creating scratch tables is worth the
overhead? Can you give explain/analyze of the query you are trying
to optimize?
merlin
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general