Re: Performance degradation in Index searches with special characters

Andrey Stikheev <andrey.stikheev@xxxxxxxxx> · Sun, 6 Oct 2024 20:28:29 +0300

Hi, Tom!

Thanks for your feedback. After looking into it further, it seems the performance issue is indeed related to the default collation settings, particularly when handling certain special characters like < in the glibc strcoll_l function. This was confirmed during my testing on Debian 12  with glibc version 2.36 (this OS and glibc are being used in our office's Docker image: https://hub.docker.com/_/postgres).

My test database settings:

test_db=# \d+ test
                                                  Table "public.test"
 Column |         Type          | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 
--------+-----------------------+-----------+----------+---------+----------+-------------+--------------+-------------
 value  | character varying(10) |           | not null |         | extended |             |              | 
Indexes:
    "idx_test" btree (value)
Access method: heap

test_db=# \l test_db
                                                 List of databases
  Name   |  Owner   | Encoding | Locale Provider |  Collate   |   Ctype    | Locale | ICU Rules | Access privileges 
---------+----------+----------+-----------------+------------+------------+--------+-----------+-------------------
 test_db | postgres | UTF8     | libc            | en_US.utf8 | en_US.utf8 |        |           | 
(1 row)

strcoll_l tests:

root@715b19170a89:~# cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
root@715b19170a89:~# ldd --version
ldd (Debian GLIBC 2.36-9+deb12u8) 2.36
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
root@715b19170a89:~# cat test.c 
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
#include <string.h>

int main() {
    char str1[] = "<";
    char *str2 = malloc(65536); // 65535 characters + 1 for null terminator

    if (!str2) return 1;

    memset(str2, '<', 65535);
    str2[65535] = '\0';

    locale_t locale = newlocale(LC_COLLATE_MASK, "en_US.UTF-8", NULL);
    int result = strcoll_l(str1, str2, locale);

    printf("Comparison result: %d\n", result);

    freelocale(locale);
    free(str2);

    return 0;
}

root@715b19170a89:~# time ./test
Comparison result: -1

real	0m4.487s
user	0m4.483s
sys	0m0.003s

I'm considering switching to ICU collations, as they might handle this more efficiently. However, as I know ICU isn’t the default collation provider in PostgreSQL, and switching to it in a live environment isn’t a straightforward process. The main concern is that glibc’s default collation (en_US.UTF-8) is widely used, and this opens up the potential for a Denial of Service (DoS) attack. For instance, if user input includes long strings of repeated characters like <, it can severely degrade performance due to the extended processing time for string comparisons, especially in high-traffic scenarios.

On Sun, 6 Oct 2024 at 19:39, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
Andrey Stikheev <andrey.stikheev@xxxxxxxxx> writes:

>    - Changing the collation to 'C' in the query significantly improves

>    performance.

What collation are you using, pray tell?  (And what database encoding?) 

>    - Is this performance degradation expected due to collation handling of

>    certain special characters in PostgreSQL?

It seems like a performance bug indeed, but not ours --- I'm thinking

it must be in strcoll() or ICU.

Trying it here (on a RHEL8 machine) with en_US.utf8 locale, I see

something similar but not quite as bad:

u8=# SELECT 1 FROM test WHERE value = repeat('<', 65536);

 ?column? 

----------

(0 rows)

Time: 1383.033 ms (00:01.383)

Poking into it with gdb says that the time is indeed spent inside

strcoll:

#0  get_next_seq (pass=1, indirect=0x7fabb4988ea8, extra=0x7fabb4984900 "", 

    table=0x7fabb490e2b0, weights=<optimized out>, 

    rulesets=0x7fabb490e2a8 "\001\002\001\005\001\001\001\005", nrules=4, 

    seq=<synthetic pointer>) at strcoll_l.c:111

#1  __GI___strcoll_l (s1=0x1785878 "<", 

    s2=0x178587a '<' <repeats 200 times>..., l=<optimized out>)

    at strcoll_l.c:338

#2  0x00000000009527a6 in strncoll_libc (arg1=<optimized out>, len1=1, 

    arg2=<optimized out>, len2=65536, locale=<optimized out>, 

    locale=<optimized out>) at pg_locale.c:1964

#3  0x00000000009ac760 in varstr_cmp (arg1=0x7fabc2dcbfe9 "<", len1=1, 

    arg2=0x17958cc '<' <repeats 200 times>..., len2=65536, 

    collid=<optimized out>) at varlena.c:1567

#4  0x00000000009acfe3 in bttextcmp (fcinfo=0x7ffddd3b0590) at varlena.c:1820

#5  0x00000000009d75fa in FunctionCall2Coll (

    flinfo=flinfo@entry=0x7ffddd3b10e8, collation=<optimized out>, 

    arg1=<optimized out>, arg2=<optimized out>) at fmgr.c:1161

#6  0x0000000000594948 in _bt_compare (rel=0x7fabcde7eed0, key=0x7ffddd3b10c0, 

    page=<optimized out>, offnum=<optimized out>) at nbtsearch.c:762

#7  0x0000000000594e32 in _bt_binsrch (rel=rel@entry=0x7fabcde7eed0, 

    key=key@entry=0x7ffddd3b10c0, buf=<optimized out>) at nbtsearch.c:394

It's not the fault of the index machinery, because a single comparison

takes the same amount of time:

u8=# select '<' <= repeat('<', 65536);

 ?column? 

----------

 t

(1 row)

Time: 1391.550 ms (00:01.392)

I didn't try it with ICU.

                        regards, tom lane

-- 
Best regards,
Andrey Stikheev