Implementing an regex filter

Enrico Weigelt <weigelt@xxxxxxxx> · Wed, 8 Aug 2007 14:16:11 +0200

Hi folks,

I've coded an auction crawler, which queries several auction
platforms for user-defined keywords. The results are then 
put into queues an sorted out from there. For example each
auction query can be passed along with an maxprice value, 
which is put into the result records. Every few minutes an
filter moves those result records where the article reached
the maxprice away to another queue. 

The blacklist filter (which makes me headaches) moves those 
records where article's title matches some regex'es to another 
queue. One of the users has more than 630 blacklist entries,
linked to one regex about 24kb. I've tried differet approaches,
comparing to each single blacklist entry (join'ing the 
blacklist table) or comparing against one huge regex (put the
compiled regex'es per user to an separate table). Both run 
very, very long (> 15mins) and make heavy load.

My current scheme (reduced to required stuff):

* article table:      article_id(oid) 
                      title(text)
		      current_price(float)
		      ... 

* user table:         user_id(oid)
                      username(text)
                      ...

* user_results table: article_id(oid)
                      user_id(oid)      
		      username(oid)    
		      queue(oid)       <-- only scanning 'incoming'
                      seen(boolean)    <-- seen results are skipped
                      ...

* user_db_list:       username(text)
                      dbname(text)
		      data(text)
		      ...

* heap:               name(text)
                      data(text)

This is the explain analyze output of the compiled-regex approach:
(the compiled regex is stored in the "heap" table)

auctionwatch=> explain analyze update base.user_results set queue='FOO' 
               WHERE queue = 'incoming' AND 
	             NOT seen AND 
                     base.user_results.article_id = base.articles.inode_id AND 
		     base.articles.end_time > current_timestamp AND 
		     base.articles.title ~ (
		        SELECT data FROM base.heap WHERE name = 
			    'blacklist.title::'||base.user_results.username);

Hash Join  (cost=2131.38..7622.69 rows=22 width=56) (actual time=1040416.087..1128977.760 rows=1 loops=1)
    Hash Cond: ("outer".article_id = "inner".inode_id)
    Join Filter: ("inner".title ~ (subplan))
    ->  Seq Scan on user_results  (cost=0.00..593.08 rows=11724 width=56) (actual time=0.104..518.036 rows=11189 loops=1)
	Filter: ((queue = 'incoming'::text) AND (NOT seen))
    ->  Hash  (cost=2014.41..2014.41 rows=8787 width=57) (actual time=250.946..250.946 rows=0 loops=1)
	->  Seq Scan on articles  (cost=0.00..2014.41 rows=8787 width=57) (actual time=0.702..232.754 rows=8663 loops=1)
	    Filter: (end_time > ('now'::text)::timestamp(6) with time zone)
    SubPlan
      ->  Seq Scan on heap  (cost=0.00..1.01 rows=1 width=32) (actual time=0.070..0.072 rows=1 loops=5998)
             Filter: (name = ('blacklist.title::'::text || $0))

Total runtime: 1129938.362 ms

And the approach via joining the regex table:

auctionwatch=> explain analyze update base.user_results set queue = 'FOO' 
               WHERE queue = 'incoming' AND 
	             NOT seen AND 
		     base.user_results.article_id = base.articles.inode_id AND 
		     base.articles.end_time > current_timestamp AND 
		     base.articles.title ~ base.user_db_list.data AND 
		     base.user_db_list.username = base.user_results.username AND 
		     base.user_db_list.dbname = 'BLACKLIST.TITLE' ;

Hash Join  (cost=3457.12..11044097.45 rows=3619812 width=56) (actual time=90458.408..126119.167 rows=2 loops=1)
   Hash Cond: ("outer".username = "inner".username)
   Join Filter: ("inner".title ~ "outer".data)
   ->  Seq Scan on user_db_list  (cost=0.00..5268.16 rows=186333 width=51) (actual time=512.939..514.394 rows=634 loops=1)
         Filter: (dbname = 'BLACKLIST.TITLE'::text)
   ->  Hash  (cost=3373.49..3373.49 rows=4254 width=109) (actual time=466.177..466.177 rows=0 loops=1)
         ->  Hash Join  (cost=2221.01..3373.49 rows=4254 width=109) (actual time=225.006..439.334 rows=6023 loops=1)
               Hash Cond: ("outer".article_id = "inner".inode_id)
               ->  Seq Scan on user_results  (cost=0.00..593.08 rows=11724 width=56) (actual time=0.155..85.865 rows=11223 loops=1)
                     Filter: ((queue = 'incoming'::text) AND (NOT seen))
               ->  Hash  (cost=2099.20..2099.20 rows=9127 width=57) (actual time=205.996..205.996 rows=0 loops=1)
                     ->  Seq Scan on articles  (cost=0.00..2099.20 rows=9127 width=57) (actual time=0.373..187.468 rows=8662 loops=1)
                           Filter: (end_time > ('now'::text)::timestamp(6) with time zone)
 Total runtime: 126921.854 ms

I'm not sure what "Total runtime" means. Is it the time the analyze
took or the query will take to execute ?

If it's really the execution time, then the second query would be
much faster (about 2mins vs. 18mins). But I really wonder, why 
is processing one huge regex so dramatically slow ?

BTW: in some tables I'm using the username instead (or parallel
to) the numerical id to skip joins against the user table. But
I'm not sure if this wise for performance.

Any hints for futher optimization appreciated :)

thx
-- 
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service - http://www.metux.de/
---------------------------------------------------------------------
 Please visit the OpenSource QM Taskforce:
 	http://wiki.metux.de/public/OpenSource_QM_Taskforce
 Patches / Fixes for a lot dozens of packages in dozens of versions:
	http://patches.metux.de/
---------------------------------------------------------------------

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org