Hello and Happy New Year, I'd like to validate or reject a performance-related PostgreSQL patch idea, regarding multiple strpos calls that operate on a common text input. From local testing using PGSQL 17.2, and searching the mailing lists, my understanding is that each function expression (including any that are duplicates) in a SQL query is evaluated independently. To check that, I constructed a pessimal demonstration (psql session snippet attached here for reference) by generating an extremely long random text blob ending with the phrase "known suffix". Running queries against that text locally I encountered the runtimes (annotated in the SQL comments) below: WHERE strpos(v, 'known') > 0; -- 2 seconds WHERE strpos(v, 'known') > 0 AND strpos(v, 'suffix') > 0; -- 4 seconds WHERE strpos(v, 'known') > 0 AND strpos(v, 'suffix') > 0 AND strpos(v, 'suffix') > 0; -- 6 seconds WHERE strpos(v, 'known') > 0 AND strpos(v, 'suffix') > 0 AND strpos(v, 'suffix') > 0 AND strpos(v, 'suffix') > 0; -- 8 seconds In other words: each additional strpos(value, ...) expression increased the evaluation time by a similar, significant duration of two seconds. This seems to confirm the basis that each expression is currently evaluated separately, by an independent read from the input text. I'd like to suggest the introduction of a documented multiple string matching algorithm[1], to yield results for each of multiple strpos calls while only reading through their common input text at-most once. In the context of considering writing a patch: would the complexity of implementing such a feature for PostgreSQL be worth the potential performance benefits? And either way, is there more I should learn about and consider? How would I provide convincing supporting evidence if I do write a patch? Thank you, James [1] Such as those described in "Flexible Pattern Matching in Strings", authored by G Navarro and M Raffinot, 2002, published by Cambridge University Press
testing=> CREATE EXTENSION pgcrypto; CREATE EXTENSION testing=> CREATE TEMPORARY TABLE test AS SELECT string_agg(gen_random_bytes(1024)::text, '') || 'known suffix' AS value FROM generate_series(1, 512*512); SELECT 1 testing=> EXPLAIN ANALYZE SELECT COUNT(*) FROM test WHERE strpos(value, 'known') > 0; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Aggregate (cost=31.53..31.54 rows=1 width=8) (actual time=2113.529..2113.530 rows=1 loops=1) -> Seq Scan on test (cost=0.00..30.40 rows=453 width=0) (actual time=2086.105..2113.521 rows=1 loops=1) Filter: (strpos(value, 'known'::text) > 0) Planning Time: 0.205 ms Execution Time: 2113.600 ms (5 rows) testing=> EXPLAIN ANALYZE SELECT COUNT(*) FROM test WHERE strpos(value, 'known') > 0 AND strpos(value, 'suffix') > 0; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Aggregate (cost=37.58..37.59 rows=1 width=8) (actual time=4196.635..4196.636 rows=1 loops=1) -> Seq Scan on test (cost=0.00..37.20 rows=151 width=0) (actual time=4140.200..4196.626 rows=1 loops=1) Filter: ((strpos(value, 'known'::text) > 0) AND (strpos(value, 'suffix'::text) > 0)) Planning Time: 0.111 ms Execution Time: 4196.683 ms (5 rows) testing=> EXPLAIN ANALYZE SELECT COUNT(*) FROM test WHERE strpos(value, 'known') > 0 AND strpos(value, 'suffix') > 0 AND strpos(value, 'suffix') > 0; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=44.38..44.39 rows=1 width=8) (actual time=6269.283..6269.284 rows=1 loops=1) -> Seq Scan on test (cost=0.00..44.00 rows=151 width=0) (actual time=6179.754..6269.275 rows=1 loops=1) Filter: ((strpos(value, 'known'::text) > 0) AND (strpos(value, 'suffix'::text) > 0) AND (strpos(value, 'suffix'::text) > 0)) Planning Time: 0.101 ms Execution Time: 6269.329 ms (5 rows) testing=> EXPLAIN ANALYZE SELECT COUNT(*) FROM test WHERE strpos(value, 'known') > 0 AND strpos(value, 'suffix') > 0 AND strpos(value, 'suffix') > 0 AND strpos(value, 'suffix') > 0; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=51.18..51.19 rows=1 width=8) (actual time=8190.171..8190.172 rows=1 loops=1) -> Seq Scan on test (cost=0.00..50.80 rows=151 width=0) (actual time=8072.736..8190.162 rows=1 loops=1) Filter: ((strpos(value, 'known'::text) > 0) AND (strpos(value, 'suffix'::text) > 0) AND (strpos(value, 'suffix'::text) > 0) AND (strpos(value, 'suffix'::text) > 0)) Planning Time: 0.118 ms Execution Time: 8190.222 ms (5 rows)