How to reduce latency with fast short queries in Postgresql 15.3 on a NUMA server

Sergio Rus <geiros@xxxxxxxxx> · Wed, 31 May 2023 14:40:05 +0200

Hi guys,

I've been configuring a new server and tuning Postgresql 15.3, but I'm struggling with a latency I'm consistently seeing with this new server when running fast short queries, compared to the other server.

We're running two different versions of Postgresql:

- Server A: Postgresql 9.3
- Server B: Postgresql 15.3

Server B is the new server and is way more powerful than server A:

- Server A: 1x Intel Xeon E3-1270 3.5GHz, 2x 8GB DDR3, RAID0
- Server B: 2x Intel Xeon Platinum 8260 2.4GHz, 4x 16GB DDR4, RAID1

We're running Linux Ubuntu 20.04 on server B and I've been tweaking some settings in Linux and Postgresql 15.3. With the current setup, Postgresql 15.3 is able to process more than 1 million transactions per second running pgbench:

    # pgbench --username postgres --select-only --client 100 --jobs 10 --time 20 test
    pgbench (15.3 (Ubuntu 15.3-1.pgdg20.04+1))
    starting vacuum...end.
    transaction type: <builtin: select only>
    scaling factor: 1
    query mode: simple
    number of clients: 100
    number of threads: 10
    maximum number of tries: 1
    duration: 20 s
    number of transactions actually processed: 23039950
    number of failed transactions: 0 (0.000%)
    latency average = 0.087 ms
    initial connection time = 62.536 ms
    tps = 1155053.135317 (without initial connection time)

As shown in pgbench, the performance is great. Also when testing individual queries, heavy queries (those taking a few ms) run faster on server B than server A. Unfortunately when we run fast short SELECT queries (< 1 ms), server A is consistently running faster than server B, even if the query plans are the same:

    Server A:

    # EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT 1 AS "a" FROM "foobar" WHERE ("foobar"."id" = 1) LIMIT 1;
                                                                             QUERY PLAN
    ------------------------------------------------------------------------------------------------------------------------------------------------------------
     Limit  (cost=0.42..8.44 rows=1 width=0) (actual time=0.008..0.008 rows=1 loops=1)
       Output: (1)
       Buffers: shared hit=5
       ->  Index Only Scan using foobar_pkey on public.foobar  (cost=0.42..8.44 rows=1 width=0) (actual time=0.008..0.008 rows=1 loops=1)
             Output: 1
             Index Cond: (foobar.id = 1)
             Heap Fetches: 1
             Buffers: shared hit=5
     Total runtime: 0.017 ms
    (9 rows)

    Time: 0.281 ms

    Server B:

    # EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT 1 AS "a" FROM "foobar" WHERE ("foobar"."id" = 1) LIMIT 1;
                                                                             QUERY PLAN
    ------------------------------------------------------------------------------------------------------------------------------------------------------------
     Limit  (cost=0.00..1.11 rows=1 width=4) (actual time=0.019..0.021 rows=1 loops=1)
       Output: 1
       Buffers: shared hit=4
       ->  Index Only Scan using foobar_pkey on public.foobar  (cost=0.00..1.11 rows=1 width=4) (actual time=0.018..0.018 rows=1 loops=1)
             Output: 1
             Index Cond: (foobar.id = 1)
             Heap Fetches: 0
             Buffers: shared hit=4
     Planning Time: 0.110 ms
     Execution Time: 0.045 ms
    (10 rows)

    Time: 0.635 ms

RAID1 could add some latency on server B if it was reading from disk, but I've confirmed that these queries are hitting the buffer/cache and therefore reading data from memory and not from disk. I've checked the hit rate with the following query:

    SELECT 'cache hit rate' AS name, sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS ratio FROM pg_statio_user_tables;

The hit rate was over 95% and it increased as soon as I ran those queries. Same thing with the index hit rate.

I've been playing with some parameters in Postgresql, decreasing/increasing the number of workers, shared buffers, work_mem, JIT, cpu_*_cost variables, etc, but nothing did help to reduce that latency.

Here are the settings I'm currently using with Postgresql 15.3 after a lot of work experimenting with different values:

    checkpoint_completion_target = 0.9
    checkpoint_timeout = 900
    cpu_index_tuple_cost = 0.00001
    cpu_operator_cost = 0.00001
    effective_cache_size = 12GB
    effective_io_concurrency = 200
    jit = off
    listen_addresses = 'localhost'
    maintenance_work_mem = 1GB
    max_connections = 100
    max_parallel_maintenance_workers = 4
    max_parallel_workers = 12
    max_parallel_workers_per_gather = 4
    max_wal_size = 4GB
    max_worker_processes = 12
    min_wal_size = 1GB
    random_page_cost = 1.1
    shared_buffers = 4GB
    ssl = off
    timezone = 'UTC'
    wal_buffers = 16MB
    work_mem = 64MB

Some notes about those settings:

    - We're running other services on this server, that's why I'm not using more resources.
    - Tweaking the cpu_*_cost parameters was crucial to improve the query plan. With the default values Postgresql was consistently using a slower query plan.

I've been looking at some settings in Linux as well:

    - Swappiness is set to the lowest safe value: vm.swappiness = 1
    - Huge Pages is not being used and Transparent Huge Pages (THP) is set to 'madvise'. Postgresql 15.3 is using the default value for the 'huge_pages' parameter: 'try'.
    - The memory overcommit policy is set to 1: vm.overcommit_memory = 1

I've been playing with Huge Pages, to try to force Postgresql using this feature. I manually allocated the number of Huge Pages as shown in this query:

    SHOW shared_memory_size_in_huge_pages;

I confirmed Huge Pages were being used by Postgresql, but unfortunately I didn't see any improvement regarding latency and performance. So I set this back to the previous state.

Conclusion:

The latency is quite low on both servers, but when you're running dozens or hundreds of fast short queries concurrently, on aggregate you see the difference, with server A being 0.1-1.0 seconds faster than server B.
As you can see, server B has 2 CPUs and is using NUMA on Linux.  And the CPU clock is slower on server B than server A. Maybe any of those are causing that latency?
Any suggestions or ideas where to look? I'd really appreciate your help.

Thank you