On Tue, Nov 9, 2010 at 4:47 AM, Matthieu Huin <matthieu.huin@xxxxxxxxxx> wrote: > Hello Merlin, > > So far the improvement in responsiveness has been very noticeable, even > without indexing the temporary tables. Of course, this is just trading > accuracy for speed as I simply narrow arbitrarily the search space ... > > The schema I am working on is close to the one I am referencing in this > thread : http://archives.postgresql.org/pgsql-general/2010-11/msg00272.php > > Since we want to implement full text search and tags querying, it can lead > to rather complex autogenerated queries such as this one (find log lines > with the word 'root' in it, dated from 11/04 to 11/06, where the 'program' > tag is sshd and the 'severity_code' tag is less than 3) : > > > EXPLAIN ANALYZE SELECT r.logid FROM (((SELECT lcond84.logid, lcond84.date > FROM > ( SELECT tmp84.logid, tmp84.date FROM logs tmp84 > WHERE TRUE AND ( to_tsvector('simple',tmp84.body) @@ > plainto_tsquery('simple','root') ) AND tmp84.date > '2010-11-04 10:22:06.26' > AND tmp84.date < '2010-11-06 10:22:06.26' ) AS lcond84 > NATURAL JOIN > ( SELECT tmp85.logid FROM tags tmp85 WHERE > FALSE > OR (tmp85.name='severity_code' AND num_lt(tmp85.value, 3) ) > OR (tmp85.name='program' AND tmp85.value = CAST(ROW('sshd') AS tagvalue) ) > GROUP BY tmp85.logid HAVING COUNT(tmp85.logid) = 2 ) AS tcond84 ) > > )) AS r ORDER BY r.date DESC LIMIT 1000; > > > > Giving the following query plan : > > Limit (cost=765445.54..765445.56 rows=9 width=16) (actual > time=34744.257..34744.257 rows=0 loops=1) > -> Sort (cost=765445.54..765445.56 rows=9 width=16) (actual > time=34744.255..34744.255 rows=0 loops=1) > Sort Key: tmp84.date > Sort Method: quicksort Memory: 17kB > -> Hash Join (cost=765005.46..765445.40 rows=9 width=16) (actual > time=34744.202..34744.202 rows=0 loops=1) > Hash Cond: (tmp85.logid = tmp84.logid) > -> HashAggregate (cost=758440.29..758669.77 rows=15299 > width=8) (actual time=33343.816..33343.816 rows=0 loops=1) > Filter: (count(tmp85.logid) = 2) > -> Bitmap Heap Scan on tags tmp85 > (cost=92363.26..757225.45 rows=242968 width=8) (actual > time=20676.354..33294.252 rows=32864 loops=1) > Recheck Cond: ((name = 'severity_code'::text) OR > (name = 'program'::text)) > Filter: (((name = 'severity_code'::text) AND > num_lt(value, 3::double precision)) OR ((name = 'program'::text) AND > ((value).storedvalue = 'sshd'::text))) > -> BitmapOr (cost=92363.26..92363.26 > rows=2148625 width=0) (actual time=19688.915..19688.915 rows=0 loops=1) > -> Bitmap Index Scan on nameval_idx > (cost=0.00..30388.35 rows=707841 width=0) (actual time=19337.358..19337.358 > rows=708719 loops=1) > Index Cond: (name = > 'severity_code'::text) > -> Bitmap Index Scan on nameval_idx > (cost=0.00..61853.42 rows=1440784 width=0) (actual time=351.551..351.551 > rows=1484703 loops=1) > Index Cond: (name = 'program'::text) > -> Hash (cost=6553.06..6553.06 rows=969 width=16) (actual > time=1400.378..1400.378 rows=32516 loops=1) > -> Bitmap Heap Scan on logs tmp84 > (cost=215.51..6553.06 rows=969 width=16) (actual time=1104.226..1383.745 > rows=32516 loops=1) > Recheck Cond: (to_tsvector('simple'::regconfig, > body) @@ '''root'''::tsquery) > Filter: ((date > '2010-11-04 > 10:22:06.26+01'::timestamp with time zone) AND (date < '2010-11-06 > 10:22:06.26+01'::timestamp with time zone)) > -> Bitmap Index Scan on fulltext_body_idx > (cost=0.00..215.27 rows=1740 width=0) (actual time=1097.874..1097.874 > rows=64340 loops=1) > Index Cond: > (to_tsvector('simple'::regconfig, body) @@ '''root'''::tsquery) > Total runtime: 34756.938 ms > > This one isn't too bad, but the runtime seems to increase exponentially with > the tables size. Therefore, using a temporary table based on the date > condition can cut the query time by a factor of up to ten (table creation > included, and provided the resulting table isn't too big - I make a COUNT > check prior to creation so that I will eventually limit manually the table > size.). But of course, I'd rather have speed AND accuracy ... > > To make things worse, the tables tend to grow very quickly since as you > might have guessed, I am working on the database part of a logs collector; > the current implementation doesn't scale well along the data. > > I hope this makes things clearer. Feel free to ask if you need more > clarifications, and thanks for your time. How are you partitioning the tags? Is the partitioned query doing the same job as the non partitioned query? Is date a forced criteria? (and if it is, have you considered date partition/brute force?) merlin -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general