The optimizer was a bit too clever. It used the same plan for the LEFT JOIN. But that put me on the right track. I tried a LATERAL join. But the optimizer saw through that too and used the same plan. So I tried a materialized CTE and that finally forced it to use a different plan. That made it run in ~70ms -- about 18x faster. Thanks!
explain analyze
with r as materialized (
select * from matching_rules
where id >= 0 and id < 60
)
select r.id, i.id
from r
join items i on i.name ~ r.name_matches
;
with r as materialized (
select * from matching_rules
where id >= 0 and id < 60
)
select r.id, i.id
from r
join items i on i.name ~ r.name_matches
;
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Nested Loop (cost=2.78..714.20 rows=230 width=8) (actual time=0.071..69.545 rows=702 loops=1)
Join Filter: (i.name ~ r.name_matches)
Rows Removed by Join Filter: 45298
CTE r
-> Seq Scan on matching_rules (cost=0.00..2.78 rows=46 width=26) (actual time=0.007..0.047 rows=46 loops=1)
Filter: ((id >= 0) AND (id < 60))
Rows Removed by Filter: 6
-> CTE Scan on r (cost=0.00..0.92 rows=46 width=36) (actual time=0.008..0.090 rows=46 loops=1)
-> Materialize (cost=0.00..23.00 rows=1000 width=27) (actual time=0.000..0.081 rows=1000 loops=46)
-> Seq Scan on items i (cost=0.00..18.00 rows=1000 width=27) (actual time=0.003..0.092 rows=1000 loops=1)
Planning Time: 0.206 ms
Execution Time: 69.633 ms
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Nested Loop (cost=2.78..714.20 rows=230 width=8) (actual time=0.071..69.545 rows=702 loops=1)
Join Filter: (i.name ~ r.name_matches)
Rows Removed by Join Filter: 45298
CTE r
-> Seq Scan on matching_rules (cost=0.00..2.78 rows=46 width=26) (actual time=0.007..0.047 rows=46 loops=1)
Filter: ((id >= 0) AND (id < 60))
Rows Removed by Filter: 6
-> CTE Scan on r (cost=0.00..0.92 rows=46 width=36) (actual time=0.008..0.090 rows=46 loops=1)
-> Materialize (cost=0.00..23.00 rows=1000 width=27) (actual time=0.000..0.081 rows=1000 loops=46)
-> Seq Scan on items i (cost=0.00..18.00 rows=1000 width=27) (actual time=0.003..0.092 rows=1000 loops=1)
Planning Time: 0.206 ms
Execution Time: 69.633 ms
On Wed, Aug 25, 2021 at 4:05 PM Justin Pryzby <pryzby@xxxxxxxxxxxxx> wrote:
On Wed, Aug 25, 2021 at 11:47:43AM -0500, Jack Christensen wrote:
> I have items that need to be categorized by user defined matching rules.
> Trusted users can create rules that include regular expressions. I've
> reduced the problem to this example.
> I use the following query to find matches:
>
> select r.id, i.id
> from items i
> join matching_rules r on i.name ~ r.name_matches;
>
> When there are few rules the query runs quickly. But as the number of rules
> increases the runtime often increases at a greater than linear rate.
Maybe it's because the REs are cached by RE_compile_and_cache(), but if you
loop over the REs in the inner loop, then the caching is ineffecive.
Maybe you can force it to join with REs on the outer loop by writing it as:
| rules LEFT JOIN items WHERE rules.id IS NOT NULL,
..to improve performance, or at least test that theory.
--
Justin