It is usually not acceptable to run applications with synchronous_commit=off, so once you have identified that the bottleneck is in implementing synchronous_commit=on, you probably need to take a deep dive into your hardware to figure out why it isn't performing the way you need/want/expect it to. Tuning the server under synchronous_commit=off when you don't intend to run your production server with that setting is unlikely to be fruitful.
I do not intend to run the server with synchronous_commit=off, but based on my limited knowledge, I'm wondering if all these observations are somehow related and are caused by the same underlying bottleneck (or misconfiguration):
1) At higher concurrency levels, TPS for synchronous_commit=off is lower for optimised settings when compared to default settings
2) At ALL concurrency levels, TPS for synchronous_commit=on is lower for optimised settings (irrespective of shared_buffers value), compared to default settings
3) At higher concurrency levels, optimised + synchronous_commit=on + shared_buffers=2G has HIGHER TPS than optimised + synchronous_commit=off + shared_buffers=8G
Here are the (completely counter-intuitive) numbers for these observations:
+--------+-----------------------------------------------------------------+------------------------+
| | synchronous_commit=on | synchronous_commit=off |
+--------+-----------------------------------------------------------------+------------------------+
| client | Mostly defaults [1] | Optimised [2] | Optimised [2] | Optimised [2] |
| | | + shared_buffers=2G | + shared_buffers=8G | + shared_buffers=8G |
+--------+---------------------+---------------------+---------------------+------------------------+
| 1 | 80-86 | 74-77 | 75-75 | 169-180 |
+--------+---------------------+---------------------+---------------------+------------------------+
| 6 | 350-376 | 301-304 | 295-300 | 1265-1397 |
+--------+---------------------+---------------------+---------------------+------------------------+
| 12 | 603-619 | 476-488 | 485-493 | 1746-2352 |
+--------+---------------------+---------------------+---------------------+------------------------+
| 24 | 947-1015 | 678-739 | 723-770 | 1869-2518 |
+--------+---------------------+---------------------+---------------------+------------------------+
| 48 | 1435-1512 | 950-1043 | 1029-1086 | 1912-2818 |
+--------+---------------------+---------------------+---------------------+------------------------+
| 96 | 1769-1811 | 3337-3459 | 1302-1346 | 1546-1753 |
+--------+---------------------+---------------------+---------------------+------------------------+
| 192 | 1857-1992 | 3613-3715 | 1269-1345 | 1332-1508 |
+--------+---------------------+---------------------+---------------------+------------------------+
| 384 | 1667-1793 | 3180-3300 | 1262-1364 | 1356-1450 |
+--------+---------------------+---------------------+---------------------+------------------------+
In case you do intend to run with synchronous_commit=off, or if you are just curious: running with a very high number of active connections often reveals subtle bottlenecks and interactions, and is very dependent on your hardware. Unless you actually intend to run our server with synchronous_commit=off and with a large number of active connections, it is probably not worth investigating this.
Please see the table above. The reason why I'm digging deeper into this is because of observation (2) above, i.e. I am unable to come up with any optimised setting that performs better than the default settings for the concurrency levels that I care about (100-150).
I'm more interested in the low end, you should do much better than those reported numbers when clients=1 and synchronous_commit=off with the data on SSD. I think you said that pgbench is running on a different machine than the database, so perhaps it is just network overhead that is keeping this value down. What happens if you run them on the same machine?
I'm currently running this, but the early numbers are surprising. For client=1, the numbers for optimised settings + shared_buffers=2G are:
-- pgbench run over a 1Gbps network: 74-77 tps
-- pgbench run on the same machine: 152-153 tps (is this absolute number good enough given my hardware?)
Is 1 Gbps network the bottleneck? Does it explain the three observations given above? I'll wait for the current set of benchmarks to finish and re-run the benchmarks over the network and monitor network utilisation.
[1] "Mostly default" settings are whatever ships with Ubuntu 18.04 + PG 11. A snippet of the relevant setts are given below:
max_connection=400
work_mem=4MB
maintenance_work_mem=64MB
shared_buffers=128MB
temp_buffers=8MB
effective_cache_size=4GB
wal_buffers=-1
wal_sync_method=fsync
max_wal_size=1GB
autovacuum=off # Auto-vacuuming was disabled
[2] Optimized settings
max_connections = 400
shared_buffers = 8GB # or 2GB -- depending upon which scenario was being evaluated
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 3495kB
min_wal_size = 1GB
max_wal_size = 2GB
max_worker_processes = 12
max_parallel_workers_per_gather = 6
max_parallel_workers = 12
autovacuum=off # Auto-vacuuming was disabled