Question: consolidating strpos searches?

James Addison <jay@xxxxxxxxxxxxxx> · Sat, 4 Jan 2025 17:16:06 +0000

Hello and Happy New Year,

I'd like to validate or reject a performance-related PostgreSQL patch
idea, regarding multiple strpos calls that operate on a common text
input.

From local testing using PGSQL 17.2, and searching the mailing lists,
my understanding is that each function expression (including any that
are duplicates) in a SQL query is evaluated independently.

To check that, I constructed a pessimal demonstration (psql session
snippet attached here for reference) by generating an extremely long
random text blob ending with the phrase "known suffix".

Running queries against that text locally I encountered the runtimes
(annotated in the SQL comments) below:

  WHERE strpos(v, 'known') > 0; -- 2 seconds
  WHERE strpos(v, 'known') > 0 AND strpos(v, 'suffix') > 0; -- 4 seconds
  WHERE strpos(v, 'known') > 0 AND strpos(v, 'suffix') > 0 AND
strpos(v, 'suffix') > 0; -- 6 seconds
  WHERE strpos(v, 'known') > 0 AND strpos(v, 'suffix') > 0 AND
strpos(v, 'suffix') > 0 AND strpos(v, 'suffix') > 0; -- 8 seconds

In other words: each additional strpos(value, ...) expression
increased the evaluation time by a similar, significant duration of
two seconds.  This seems to confirm the basis that each expression is
currently evaluated separately, by an independent read from the input
text.

I'd like to suggest the introduction of a documented multiple string
matching algorithm[1], to yield results for each of multiple strpos
calls while only reading through their common input text at-most once.

In the context of considering writing a patch: would the complexity of
implementing such a feature for PostgreSQL be worth the potential
performance benefits?  And either way, is there more I should learn
about and consider?  How would I provide convincing supporting
evidence if I do write a patch?

Thank you,
James

[1] Such as those described in "Flexible Pattern Matching in Strings",
authored by G Navarro and M Raffinot, 2002, published by Cambridge
University Press
testing=> CREATE EXTENSION pgcrypto;
CREATE EXTENSION
testing=> CREATE TEMPORARY TABLE test AS SELECT string_agg(gen_random_bytes(1024)::text, '') || 'known suffix' AS value FROM generate_series(1, 512*512);
SELECT 1
testing=> EXPLAIN ANALYZE SELECT COUNT(*) FROM test WHERE strpos(value, 'known') > 0;
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=31.53..31.54 rows=1 width=8) (actual time=2113.529..2113.530 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..30.40 rows=453 width=0) (actual time=2086.105..2113.521 rows=1 loops=1)
         Filter: (strpos(value, 'known'::text) > 0)
 Planning Time: 0.205 ms
 Execution Time: 2113.600 ms
(5 rows)

testing=> EXPLAIN ANALYZE SELECT COUNT(*) FROM test WHERE strpos(value, 'known') > 0 AND strpos(value, 'suffix') > 0;
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=37.58..37.59 rows=1 width=8) (actual time=4196.635..4196.636 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..37.20 rows=151 width=0) (actual time=4140.200..4196.626 rows=1 loops=1)
         Filter: ((strpos(value, 'known'::text) > 0) AND (strpos(value, 'suffix'::text) > 0))
 Planning Time: 0.111 ms
 Execution Time: 4196.683 ms
(5 rows)

testing=> EXPLAIN ANALYZE SELECT COUNT(*) FROM test WHERE strpos(value, 'known') > 0 AND strpos(value, 'suffix') > 0 AND strpos(value, 'suffix') > 0;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=44.38..44.39 rows=1 width=8) (actual time=6269.283..6269.284 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..44.00 rows=151 width=0) (actual time=6179.754..6269.275 rows=1 loops=1)
         Filter: ((strpos(value, 'known'::text) > 0) AND (strpos(value, 'suffix'::text) > 0) AND (strpos(value, 'suffix'::text) > 0))
 Planning Time: 0.101 ms
 Execution Time: 6269.329 ms
(5 rows)

testing=> EXPLAIN ANALYZE SELECT COUNT(*) FROM test WHERE strpos(value, 'known') > 0 AND strpos(value, 'suffix') > 0 AND strpos(value, 'suffix') > 0 AND strpos(value, 'suffix') > 0;
                                                                                  QUERY PLAN                                                                                  
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=51.18..51.19 rows=1 width=8) (actual time=8190.171..8190.172 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..50.80 rows=151 width=0) (actual time=8072.736..8190.162 rows=1 loops=1)
         Filter: ((strpos(value, 'known'::text) > 0) AND (strpos(value, 'suffix'::text) > 0) AND (strpos(value, 'suffix'::text) > 0) AND (strpos(value, 'suffix'::text) > 0))
 Planning Time: 0.118 ms
 Execution Time: 8190.222 ms
(5 rows)