Performance degradation in Index searches with special characters

Andrey Stikheev <andrey.stikheev@xxxxxxxxx> · Sat, 5 Oct 2024 12:23:56 +0300

Dear PostgreSQL Community,
I am facing significant performance issues when executing queries that involve string comparisons with special characters, such as <, #, !, @, etc., especially when dealing with long strings. The query execution time increases drastically when these characters are used, whereas queries with alphabetic characters do not show such degradation. This behavior is observed both on macOS (using the official postgres:17 image via Docker) and on an Ubuntu 20.04 server running PostgreSQL in an LXC container.
Here is a minimal example:

testdb=# SELECT version();
                                                          version                                                          
---------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 17.0 (Debian 17.0-1.pgdg120+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
(1 row)

testdb=# CREATE TABLE test (value VARCHAR(10) NOT NULL);
CREATE TABLE
Time: 3.562 ms

testdb=# CREATE INDEX idx_test ON test (value);
CREATE INDEX
Time: 3.080 ms

testdb=# INSERT INTO test (value) VALUES ('<');
INSERT 0 1
Time: 3.365 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
 ?column? 
----------
(0 rows)
Time: 4454.535 ms (00:04.455)

testdb=# SELECT 1 FROM test WHERE value = repeat('a', 65536);
 ?column? 
----------
(0 rows)
Time: 3.772 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) || 'a';
 ?column? 
----------
(0 rows)
Time: 4.352 ms

Time: 9.503 ms
testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) COLLATE "C";
 ?column? 
----------
(0 rows)

Time: 3.299 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('@', 8192);
 ?column? 
----------
(0 rows)

Time: 77.171 ms
testdb=# SELECT 1 FROM test WHERE value = repeat('@', 16384);
 ?column? 
----------
(0 rows)

Time: 325.190 ms
testdb=# SELECT 1 FROM test WHERE value = repeat('@', 32768);
 ?column? 
----------
(0 rows)

Time: 1154.850 ms (00:01.155)
testdb=# SELECT 1 FROM test WHERE value = repeat('@', 65536);
 ?column? 
----------
(0 rows)

Time: 4490.206 ms (00:04.490)

testdb=# explain (analyze, verbose, buffers, costs, settings, timing, wal) SELECT 1 FROM test WHERE value = repeat('<', 65000);

---------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.test  (cost=4.20..13.67 rows=6 width=4) (actual time=4425.459..4425.459 rows=0 loops=1)
   Output: 1
   Recheck Cond: ((test.value)::text = '<<<<<<<<<<<<<<<<<<<<<<<<<<<< ... a lot symbols ...
   <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<'::text)
   Buffers: shared hit=1
   ->  Bitmap Index Scan on idx_test  (cost=0.00..4.20 rows=6 width=0) (actual time=4425.432..4425.432 rows=0 loops=1)
         Index Cond: ((test.value)::text = '<<<<<<<<<<<<<<<<<<<<<<<<<<<< ... a lot symbols ...<<<<<<<<<<<<<<<<<<<<<<<<<<'::text)
         Buffers: shared hit=1
 Planning Time: 1.082 ms
 Execution Time: 4425.602 ms
(9 rows)

Time: 4433.001 ms (00:04.433)

Observations:
The performance degradation occurs with certain special characters like < , !,  >, @ , #, ... .
Queries using alphabetic characters or appending/prepending characters execute much faster.
The execution time increases exponentially with the length of the string composed of special characters.
Changing the collation to 'C' in the query significantly improves performance.
Questions:
Is this performance degradation expected due to collation handling of certain special characters in PostgreSQL?
Are there any recommendations to improve performance without changing the column or database collation?
-- 
Best regards,
Andrey Stikheev