Re: Performance degradation in Index searches with special characters

Shiv Iyer <shiv@xxxxxxxxxxxxx> · Mon, 7 Oct 2024 01:32:01 +0530

Hi Andrey, 

I have tried my best to answer your queries below: 

### Performance Degradation with Special Characters in PostgreSQL

#### **Explanation**:
The performance degradation you're experiencing when using special characters like `<`, `@`, `#`, etc., is likely due to how PostgreSQL handles **collations**. By default, PostgreSQL uses locale-aware collations (typically UTF-8) for string comparisons, which involve complex character sorting and encoding. Special characters may be treated differently under certain locales, resulting in slower comparisons, especially with longer strings.

#### **Key Observations**:
1. **Alphabetic vs. Special Characters**:
   - Queries with alphabetic characters perform significantly faster than those with special characters.
   - This is due to differences in how the collation handles comparisons for alphabetic and non-alphabetic characters. Special characters often require more computational resources for comparison, resulting in longer execution times.

2. **Impact of Collation**:
   - When you switched the collation to **"C"**, the performance improved substantially. This is because the `"C"` collation uses **byte-wise** comparisons rather than locale-aware sorting, which simplifies the comparison logic, especially for special characters.

3. **String Length Impact**:
   - As the string length increases, the performance degrades exponentially when using special characters. This is due to the collation’s computational complexity for each additional character comparison.

#### **Recommendations**:

1. **Use the "C" Collation**:
   - The most effective improvement you observed came from using the `"C"` collation, which performs simple byte-by-byte comparisons. You can apply the `"C"` collation at the query level or alter the column definition:
     ```sql
     SELECT 1 FROM test WHERE value = repeat('<', 65536) COLLATE "C";
     ```
     or
     ```sql
     ALTER TABLE test ALTER COLUMN value SET DATA TYPE VARCHAR(10) COLLATE "C";
     ```

2. **Partial Indexing**:
   - Create an index with a specific collation for certain characters. This allows PostgreSQL to use the optimized comparison method only when dealing with special characters:
     ```sql
     CREATE INDEX idx_test_special_chars ON test (value COLLATE "C");
     ```

3. **Reduce String Length**:
   - If possible, reduce the string length for comparisons involving special characters. The performance impact grows with longer strings due to the nature of locale-aware collation.

4. **Use Functional Indexes**:
   - You can use functional indexes that apply transformations (e.g., trimming special characters or converting to ASCII) before comparison:
     ```sql
     CREATE INDEX idx_test_transformed ON test (lower(value));
     ```

5. **Optimize Queries**:
   - For repeated queries involving long strings with special characters, try caching results or using materialized views where the collation overhead is reduced by pre-computing.

6. **Custom Collations**:
   - If you need locale-aware sorting but can compromise on certain aspects, consider creating a **custom collation** that simplifies special character handling.

#### **Answers to Your Questions**:
1. **Is the performance degradation expected due to collation handling of certain special characters?**
   - Yes, this behavior is expected. Locale-aware collations can be complex for special characters, leading to longer comparison times.

2. **Are there any recommendations to improve performance without changing the column or database collation?**
   - Without changing the collation entirely, you can use:
     - **Query-level collation adjustments** (using the `"C"` collation in specific queries).
     - **Partial indexes** with the `"C"` collation.
     - **Functional indexes** or optimizations like materialized views.

On Sun, Oct 6, 2024 at 3:53 PM Andrey Stikheev <andrey.stikheev@xxxxxxxxx> wrote:
Dear PostgreSQL Community,
I am facing significant performance issues when executing queries that involve string comparisons with special characters, such as <, #, !, @, etc., especially when dealing with long strings. The query execution time increases drastically when these characters are used, whereas queries with alphabetic characters do not show such degradation. This behavior is observed both on macOS (using the official postgres:17 image via Docker) and on an Ubuntu 20.04 server running PostgreSQL in an LXC container.
Here is a minimal example:

testdb=# SELECT version();
                                                          version                                                          
---------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 17.0 (Debian 17.0-1.pgdg120+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
(1 row)

testdb=# CREATE TABLE test (value VARCHAR(10) NOT NULL);
CREATE TABLE
Time: 3.562 ms

testdb=# CREATE INDEX idx_test ON test (value);
CREATE INDEX
Time: 3.080 ms

testdb=# INSERT INTO test (value) VALUES ('<');
INSERT 0 1
Time: 3.365 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
 ?column? 
----------
(0 rows)
Time: 4454.535 ms (00:04.455)

testdb=# SELECT 1 FROM test WHERE value = repeat('a', 65536);
 ?column? 
----------
(0 rows)
Time: 3.772 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) || 'a';
 ?column? 
----------
(0 rows)
Time: 4.352 ms

Time: 9.503 ms
testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) COLLATE "C";
 ?column? 
----------
(0 rows)

Time: 3.299 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('@', 8192);
 ?column? 
----------
(0 rows)

Time: 77.171 ms
testdb=# SELECT 1 FROM test WHERE value = repeat('@', 16384);
 ?column? 
----------
(0 rows)

Time: 325.190 ms
testdb=# SELECT 1 FROM test WHERE value = repeat('@', 32768);
 ?column? 
----------
(0 rows)

Time: 1154.850 ms (00:01.155)
testdb=# SELECT 1 FROM test WHERE value = repeat('@', 65536);
 ?column? 
----------
(0 rows)

Time: 4490.206 ms (00:04.490)

testdb=# explain (analyze, verbose, buffers, costs, settings, timing, wal) SELECT 1 FROM test WHERE value = repeat('<', 65000);

---------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.test  (cost=4.20..13.67 rows=6 width=4) (actual time=4425.459..4425.459 rows=0 loops=1)
   Output: 1
   Recheck Cond: ((test.value)::text = '<<<<<<<<<<<<<<<<<<<<<<<<<<<< ... a lot symbols ...
   <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<'::text)
   Buffers: shared hit=1
   ->  Bitmap Index Scan on idx_test  (cost=0.00..4.20 rows=6 width=0) (actual time=4425.432..4425.432 rows=0 loops=1)
         Index Cond: ((test.value)::text = '<<<<<<<<<<<<<<<<<<<<<<<<<<<< ... a lot symbols ...<<<<<<<<<<<<<<<<<<<<<<<<<<'::text)
         Buffers: shared hit=1
 Planning Time: 1.082 ms
 Execution Time: 4425.602 ms
(9 rows)

Time: 4433.001 ms (00:04.433)

Observations:
The performance degradation occurs with certain special characters like < , !,  >, @ , #, ... .
Queries using alphabetic characters or appending/prepending characters execute much faster.
The execution time increases exponentially with the length of the string composed of special characters.
Changing the collation to 'C' in the query significantly improves performance.
Questions:
Is this performance degradation expected due to collation handling of certain special characters in PostgreSQL?
Are there any recommendations to improve performance without changing the column or database collation?
-- 
Best regards,
Andrey Stikheev

-- 

Best Regards 
Shiv Iyer