Re: looking for a faster way to do that

hamann.w@xxxxxxxxxxx · 25 Sep 2011 06:04:43 -0000

Eduardo Morras wrote:

>> >
>> >Hi,
>> >
>> >if I understand this right, it does not mean "check if the string 
>> >appears at position 0"
>> >which could translate into an index query, but rather "check if the 
>> >string appears anywhere
>> >and then check if that is position 0", so the entire table is checked.
>> 
>> The second one yes, as it checks all patterns you want only one time 
>> per row they only needs one table scan. The first one eliminates the 
>> substring 'ABC' from the string, if the lengths of both strings are 
>> equal, the substring 'ABC' wasn't in it. If they are different, the 
>> trimmed string will be shorter.
>> 
>> >explain analyze select items.num, wantcode from items, n where 
>> >strpos(code, wantcode) = 0;
>> >  Nested Loop  (cost=167.14..196066.54 rows=39178 width=36) (actual 
>> > time=0.074..36639.312 rows=7832539 loops=1)
>> >    Join Filter: (strpos(("inner".code)::text, "outer".wantcode) = 0)
>> >    ->  Seq Scan on n  (cost=0.00..14.15 rows=815 width=32) (actual 
>> > time=0.005..2.212 rows=815 loops=1)
>> >    ->  Materialize  (cost=167.14..263.28 rows=9614 width=42) 
>> > (actual time=0.007..13.970 rows=9614 loops=815)
>> >          ->  Seq Scan on items  (cost=0.00..167.14 rows=9614 
>> > width=42) (actual time=0.044..14.855 rows=9614 loops=1)
>> >  Total runtime: 46229.836 ms
>> >
>> >
>> >The query ran much faster than the pattern query, however. This 
>> >seems to be the performance
>> >of just searching for a plain string vs. initializing the regex 
>> >engine every time (for 815
>> >queries in a test set)
>> 
>> It will do only one table scan while your original code will do one 
>> for each substring you want to test. You can add more and more 
>> substrings without too much cost. If you want to use the regex engine 
>> instead the postgresql string funtions check the regexp_matches(), it 
>> should be faster if you have 3000 substrings.
>> 
>> select * from items where regexp_matches(items.code,'(ABC) (DE1) 
>> (any_substring)')<>{};
>> 

Hi Eduardo,

it is clear that scanning the table once with a list of matches will outperform
rescanning the table for every string wanted. Now, my problem is that the patterns are
dynamic as well. So if I could translate a table with one column  and a few thousand rows
into something like
regexp_matches(code,'string1|string2|.....string2781')
would ideally be a performant query. Unfortunately I have no idea how I could achieve this
transformation inside the database. Doing it externally fails, because any single query cannot
be more than so few characters.

Regards
Wolfgang Hamann

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general