Hi folks, I've coded an auction crawler, which queries several auction platforms for user-defined keywords. The results are then put into queues an sorted out from there. For example each auction query can be passed along with an maxprice value, which is put into the result records. Every few minutes an filter moves those result records where the article reached the maxprice away to another queue. The blacklist filter (which makes me headaches) moves those records where article's title matches some regex'es to another queue. One of the users has more than 630 blacklist entries, linked to one regex about 24kb. I've tried differet approaches, comparing to each single blacklist entry (join'ing the blacklist table) or comparing against one huge regex (put the compiled regex'es per user to an separate table). Both run very, very long (> 15mins) and make heavy load. My current scheme (reduced to required stuff): * article table: article_id(oid) title(text) current_price(float) ... * user table: user_id(oid) username(text) ... * user_results table: article_id(oid) user_id(oid) username(oid) queue(oid) <-- only scanning 'incoming' seen(boolean) <-- seen results are skipped ... * user_db_list: username(text) dbname(text) data(text) ... * heap: name(text) data(text) This is the explain analyze output of the compiled-regex approach: (the compiled regex is stored in the "heap" table) auctionwatch=> explain analyze update base.user_results set queue='FOO' WHERE queue = 'incoming' AND NOT seen AND base.user_results.article_id = base.articles.inode_id AND base.articles.end_time > current_timestamp AND base.articles.title ~ ( SELECT data FROM base.heap WHERE name = 'blacklist.title::'||base.user_results.username); Hash Join (cost=2131.38..7622.69 rows=22 width=56) (actual time=1040416.087..1128977.760 rows=1 loops=1) Hash Cond: ("outer".article_id = "inner".inode_id) Join Filter: ("inner".title ~ (subplan)) -> Seq Scan on user_results (cost=0.00..593.08 rows=11724 width=56) (actual time=0.104..518.036 rows=11189 loops=1) Filter: ((queue = 'incoming'::text) AND (NOT seen)) -> Hash (cost=2014.41..2014.41 rows=8787 width=57) (actual time=250.946..250.946 rows=0 loops=1) -> Seq Scan on articles (cost=0.00..2014.41 rows=8787 width=57) (actual time=0.702..232.754 rows=8663 loops=1) Filter: (end_time > ('now'::text)::timestamp(6) with time zone) SubPlan -> Seq Scan on heap (cost=0.00..1.01 rows=1 width=32) (actual time=0.070..0.072 rows=1 loops=5998) Filter: (name = ('blacklist.title::'::text || $0)) Total runtime: 1129938.362 ms And the approach via joining the regex table: auctionwatch=> explain analyze update base.user_results set queue = 'FOO' WHERE queue = 'incoming' AND NOT seen AND base.user_results.article_id = base.articles.inode_id AND base.articles.end_time > current_timestamp AND base.articles.title ~ base.user_db_list.data AND base.user_db_list.username = base.user_results.username AND base.user_db_list.dbname = 'BLACKLIST.TITLE' ; Hash Join (cost=3457.12..11044097.45 rows=3619812 width=56) (actual time=90458.408..126119.167 rows=2 loops=1) Hash Cond: ("outer".username = "inner".username) Join Filter: ("inner".title ~ "outer".data) -> Seq Scan on user_db_list (cost=0.00..5268.16 rows=186333 width=51) (actual time=512.939..514.394 rows=634 loops=1) Filter: (dbname = 'BLACKLIST.TITLE'::text) -> Hash (cost=3373.49..3373.49 rows=4254 width=109) (actual time=466.177..466.177 rows=0 loops=1) -> Hash Join (cost=2221.01..3373.49 rows=4254 width=109) (actual time=225.006..439.334 rows=6023 loops=1) Hash Cond: ("outer".article_id = "inner".inode_id) -> Seq Scan on user_results (cost=0.00..593.08 rows=11724 width=56) (actual time=0.155..85.865 rows=11223 loops=1) Filter: ((queue = 'incoming'::text) AND (NOT seen)) -> Hash (cost=2099.20..2099.20 rows=9127 width=57) (actual time=205.996..205.996 rows=0 loops=1) -> Seq Scan on articles (cost=0.00..2099.20 rows=9127 width=57) (actual time=0.373..187.468 rows=8662 loops=1) Filter: (end_time > ('now'::text)::timestamp(6) with time zone) Total runtime: 126921.854 ms I'm not sure what "Total runtime" means. Is it the time the analyze took or the query will take to execute ? If it's really the execution time, then the second query would be much faster (about 2mins vs. 18mins). But I really wonder, why is processing one huge regex so dramatically slow ? BTW: in some tables I'm using the username instead (or parallel to) the numerical id to skip joins against the user table. But I'm not sure if this wise for performance. Any hints for futher optimization appreciated :) thx -- --------------------------------------------------------------------- Enrico Weigelt == metux IT service - http://www.metux.de/ --------------------------------------------------------------------- Please visit the OpenSource QM Taskforce: http://wiki.metux.de/public/OpenSource_QM_Taskforce Patches / Fixes for a lot dozens of packages in dozens of versions: http://patches.metux.de/ --------------------------------------------------------------------- ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org