Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

Saurabh Nanda <saurabhnanda@xxxxxxxxx> · Mon, 28 Jan 2019 10:22:17 +0530

It is usually not acceptable to run applications with synchronous_commit=off, so once you have identified that the bottleneck is in implementing synchronous_commit=on, you probably need to take a deep dive into your hardware to figure out why it isn't performing the way you need/want/expect it to.  Tuning the server under synchronous_commit=off when you don't intend to run your production server with that setting is unlikely to be fruitful.

I do not intend to run the server with synchronous_commit=off, but based on my limited knowledge, I'm wondering if all these observations are somehow related and are caused by the same underlying bottleneck (or misconfiguration):

1) At higher concurrency levels, TPS for synchronous_commit=off is lower for optimised settings when compared to default settings
2) At ALL concurrency levels, TPS for synchronous_commit=on is lower for optimised settings (irrespective of shared_buffers value), compared to default settings
3) At higher concurrency levels, optimised + synchronous_commit=on + shared_buffers=2G has HIGHER TPS than optimised + synchronous_commit=off + shared_buffers=8G

Here are the (completely counter-intuitive) numbers for these observations:

+--------+-----------------------------------------------------------------+------------------------+
|        |                      synchronous_commit=on                      | synchronous_commit=off |
+--------+-----------------------------------------------------------------+------------------------+
| client | Mostly defaults [1] | Optimised [2]       | Optimised [2]       | Optimised [2]          |
|        |                     | + shared_buffers=2G | + shared_buffers=8G | + shared_buffers=8G    |
+--------+---------------------+---------------------+---------------------+------------------------+
| 1      | 80-86               | 74-77               | 75-75               | 169-180                |
+--------+---------------------+---------------------+---------------------+------------------------+
| 6      | 350-376             | 301-304             | 295-300             | 1265-1397              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 12     | 603-619             | 476-488             | 485-493             | 1746-2352              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 24     | 947-1015            | 678-739             | 723-770             | 1869-2518              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 48     | 1435-1512           | 950-1043            | 1029-1086           | 1912-2818              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 96     | 1769-1811           | 3337-3459           | 1302-1346           | 1546-1753              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 192    | 1857-1992           | 3613-3715           | 1269-1345           | 1332-1508              |
+--------+---------------------+---------------------+---------------------+------------------------+
| 384    | 1667-1793           | 3180-3300           | 1262-1364           | 1356-1450              |
+--------+---------------------+---------------------+---------------------+------------------------+

In case you do intend to run with synchronous_commit=off, or if you are just curious:  running with a very high number of active connections often reveals subtle bottlenecks and interactions, and is very dependent on your hardware.  Unless you actually intend to run our server with synchronous_commit=off and with a large number of active connections, it is probably not worth investigating this. 

Please see the table above. The reason why I'm digging deeper into this is because of observation (2) above, i.e. I am unable to come up with any optimised setting that performs better than the default settings for the concurrency levels that I care about (100-150).

I'm more interested in the low end, you should do much better than those reported numbers when clients=1 and synchronous_commit=off with the data on SSD.  I think you said that pgbench is running on a different machine than the database, so perhaps it is just network overhead that is keeping this value down.  What happens if you run them on the same machine?

I'm currently running this, but the early numbers are surprising. For client=1, the numbers for optimised settings + shared_buffers=2G are:

-- pgbench run over a 1Gbps network: 74-77 tps
-- pgbench run on the same machine: 152-153 tps (is this absolute number good enough given my hardware?)

Is 1 Gbps network the bottleneck? Does it explain the three observations given above? I'll wait for the current set of benchmarks to finish and re-run the benchmarks over the network and monitor network utilisation.

[1] "Mostly default" settings are whatever ships with Ubuntu 18.04 + PG 11. A snippet of the relevant setts are given below:

    max_connection=400
    work_mem=4MB
    maintenance_work_mem=64MB
    shared_buffers=128MB
    temp_buffers=8MB
    effective_cache_size=4GB
    wal_buffers=-1
    wal_sync_method=fsync
    max_wal_size=1GB
    autovacuum=off            # Auto-vacuuming was disabled

[2] Optimized settings

    max_connections = 400
    shared_buffers = 8GB           # or 2GB -- depending upon which scenario was being evaluated
    effective_cache_size = 24GB
    maintenance_work_mem = 2GB
    checkpoint_completion_target = 0.7
    wal_buffers = 16MB
    default_statistics_target = 100
    random_page_cost = 1.1
    effective_io_concurrency = 200
    work_mem = 3495kB
    min_wal_size = 1GB
    max_wal_size = 2GB
    max_worker_processes = 12
    max_parallel_workers_per_gather = 6
    max_parallel_workers = 12
    autovacuum=off            # Auto-vacuuming was disabled