Re: 60 core performance with 9.3

Mark Kirkwood <mark.kirkwood@xxxxxxxxxxxxxxx> · Fri, 11 Jul 2014 12:40:15 +1200

On 01/07/14 22:13, Andres Freund wrote:
On 2014-07-01 21:48:35 +1200, Mark Kirkwood wrote:
- cherry picking the last 5 commits into 9.4 branch and building a package
from that and retesting:

Clients | 9.4 tps 60 cores (rwlock)
--------+--------------------------
6       |  70189
12      | 128894
24      | 233542
48      | 422754
96      | 590796
192     | 630672

Wow - that is more like it! Andres that is some nice work, we definitely owe
you some beers for that :-) I am aware that I need to retest with an
unpatched 9.4 src - as it is not clear from this data how much is due to
Andres's patches and how much to the steady stream of 9.4 development. I'll
post an update on that later, but figured this was interesting enough to
note for now.

Cool. That's what I like (and expect) to see :). I don't think unpatched
9.4 will show significantly different results than 9.3, but it'd be good
to validate that. If you do so, could you post the results in the
-hackers thread I just CCed you on? That'll help the work to get into
9.5.

So we seem to have nailed read only performance. Going back and 
revisiting read write performance finds:

Postgres 9.4 beta
rwlock patch
pgbench scale = 2000

max_connections = 200;
shared_buffers = "10GB";
maintenance_work_mem = "1GB";
effective_io_concurrency = 10;
wal_buffers = "32MB";
checkpoint_segments = 192;
checkpoint_completion_target = 0.8;

clients  | tps (32 cores) | tps
---------+----------------+---------
6        |   8313         |   8175
12       |  11012         |  14409
24       |  16151         |  17191
48       |  21153         |  23122
96       |  21977         |  22308
192      |  22917         |  23109

So we are back to not doing significantly better than 32 cores. Hmmm. 
Doing quite a few more tweaks gets some better numbers:

kernel.sched_autogroup_enabled=0
kernel.sched_migration_cost_ns=5000000
net.core.somaxconn=1024
/sys/kernel/mm/transparent_hugepage/enabled [never]

+checkpoint_segments = 1920
+wal_buffers = "256MB";

clients  | tps
---------+---------
6        |   8366
12       |  15988
24       |  19828
48       |  30315
96       |  31649
192      |  29497

One more:

+wal__sync_method = "open_datasync"

clients  | tps
---------+---------
6        |  9566
12       | 17129
24       | 22962
48       | 34564
96       | 32584
192      | 28367

So this looks better - however I suspect 32 core performance would 
improve with these as well!

The problem does *not* look to be connected with IO (I will include some 
iostat below). So time to get the profiler out (192 clients for 1 minute):

Full report http://paste.ubuntu.com/7777886/

# ========
# captured on: Fri Jul 11 03:09:06 2014
# hostname : ncel-prod-db3
# os release : 3.13.0-24-generic
# perf version : 3.13.9
# arch : x86_64
# nrcpus online : 60
# nrcpus avail : 60
# cpudesc : Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
# cpuid : GenuineIntel,6,62,7
# total memory : 1056692116 kB
# cmdline : /usr/lib/linux-tools-3.13.0-24/perf record -ag
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 
= 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, 
precise_ip = 0, attr_mmap2 = 0, attr_mmap  = 1, attr_mmap_data = 0
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, uncore_cbox_10 = 17, uncore_cbox_11 = 18, 
uncore_cbox_12 = 19, uncore_cbox_13 = 20, uncore_cbox_14 = 21, software 
= 1, uncore_irp = 33, uncore_pcu = 22, tracepoint = 2, uncore_imc_0 = 
25, uncore_imc_1 = 26, uncore_imc_2 = 27, uncore_imc_3 = 28, 
uncore_imc_4 = 29, uncore_imc_5 = 30, uncore_imc_6 = 31, uncore_imc_7 = 
32, uncore_qpi_0 = 34, uncore_qpi_1 = 35, uncore_qpi_2 = 36, 
uncore_cbox_0 = 7, uncore_cbox_1 = 8, uncore_cbox_2 = 9, uncore_cbox_3 = 
10, uncore_cbox_4 = 11, uncore_cbox_5 = 12, uncore_cbox_6 = 13, 
uncore_cbox_7 = 14, uncore_cbox_8 = 15, uncore_cbox_9 = 16, 
uncore_r2pcie = 37, uncore_r3qpi_0 = 38, uncore_r3qpi_1 = 39, breakpoint 
= 5, uncore_ha_0 = 23, uncore_ha_1 = 24, uncore_ubox = 6
# ========
#
# Samples: 1M of event 'cycles'
# Event count (approx.): 359906321606
#
# Overhead         Command            Shared Object 
                            Symbol
# ........  ..............  ....................... 
.....................................................
#
     8.82%        postgres  [kernel.kallsyms]        [k] 
_raw_spin_lock_irqsave
                  |
                  --- _raw_spin_lock_irqsave
                     |
                     |--75.69%-- pagevec_lru_move_fn
                     |          __lru_cache_add
                     |          lru_cache_add
                     |          putback_lru_page
                     |          migrate_pages
                     |          migrate_misplaced_page
                     |          do_numa_page
                     |          handle_mm_fault
                     |          __do_page_fault
                     |          do_page_fault
                     |          page_fault
                     |          |
                     |          |--31.07%-- PinBuffer
                     |          |          |
                     |          |           --100.00%-- ReadBuffer_common
                     |          |                     |
                     |          |                      --100.00%-- 
ReadBufferExtended
                     |          |                                | 

                     |          | 
|--71.62%-- index_fetch_heap
                     |          |                                | 
     index_getnext
                     |          |                                | 
     IndexNext
                     |          |                                | 
     ExecScan
                     |          |                                | 
     ExecProcNode
                     |          |                                | 
     ExecModifyTable
                     |          |                                | 
     ExecProcNode
                     |          |                                | 
     standard_ExecutorRun
                     |          |                                | 
     ProcessQuery
                     |          |                                | 
     PortalRunMulti
                     |          |                                | 
     PortalRun
                     |          |                                | 
     PostgresMain
                     |          |                                | 
     ServerLoop
                     |          |                                | 

                     |          | 
|--17.47%-- heap_hot_search
                     |          |                                | 
     _bt_check_unique
                     |          |                                | 
     _bt_doinsert
                     |          |                                | 
     btinsert
                     |          |                                | 
     FunctionCall6Coll
                     |          |                                | 
     index_insert
                     |          |                                | 
     |
                     |          |                                | 
      --100.00%-- ExecInsertIndexTuples
                     |          |                                | 
                ExecModifyTable
                     |          |                                | 
                ExecProcNode
                     |          |                                | 
                standard_ExecutorRun
                     |          |                                | 
                ProcessQuery
                     |          |                                | 
                PortalRunMulti
                     |          |                                | 
                PortalRun
                     |          |                                | 
                PostgresMain
                     |          |                                | 
                ServerLoop
                     |          |                                | 

                     |          | 
|--3.81%-- RelationGetBufferForTuple
                     |          |                                | 
     heap_update
                     |          |                                | 
     ExecModifyTable
                     |          |                                | 
     ExecProcNode
                     |          |                                | 
     standard_ExecutorRun
                     |          |                                | 
     ProcessQuery
                     |          |                                | 
     PortalRunMulti
                     |          |                                | 
     PortalRun
                     |          |                                | 
     PostgresMain
                     |          |                                | 
     ServerLoop
                     |          |                                | 

                     |          | 
|--3.65%-- _bt_relandgetbuf
                     |          |                                | 
     _bt_search
                     |          |                                | 
     _bt_first
                     |          |                                | 
     |
                     |          |                                | 
      --100.00%-- btgettuple
                     |          |                                | 
                FunctionCall2Coll
                     |          |                                | 
                index_getnext_tid
                     |          |                                | 
                index_getnext
                     |          |                                | 
                IndexNext
                     |          |                                | 
                ExecScan
                     |          |                                | 
                ExecProcNode
                     |          |                                | 
                |
                     |          |                                | 
                |--97.56%-- ExecModifyTable
                     |          |                                | 
                |          ExecProcNode
                     |          |                                | 
                |          standard_ExecutorRun
                     |          |                                | 
                |          ProcessQuery
                     |          |                                | 
                |          PortalRunMulti
                     |          |                                | 
                |          PortalRun
                     |          |                                | 
                |          PostgresMain
                     |          |                                | 
                |          ServerLoop
                     |          |                                | 
                |
                     |          |                                | 
                 --2.44%-- standard_ExecutorRun
                     |          |                                | 
                           PortalRunSelect
                     |          |                                | 
                           PortalRun
                     |          |                                | 
                           PostgresMain
                     |          |                                | 
                           ServerLoop
                     |          |                                | 

                     |          | 
|--2.69%-- fsm_readbuf
                     |          |                                | 
     fsm_set_and_search
                     |          |                                | 
     RecordPageWithFreeSpace
                     |          |                                | 
     lazy_vacuum_rel
                     |          |                                | 
     vacuum_rel
                     |          |                                | 
     vacuum
                     |          |                                | 
     do_autovacuum
                     |          |                                | 

                     |          | 
--0.75%-- lazy_vacuum_rel
                     |          | 
     vacuum_rel
                     |          | 
     vacuum
                     |          | 
     do_autovacuum
                     |          |
                     |          |--4.66%-- SearchCatCache
                     |          |          |
                     |          |          |--49.62%-- oper
                     |          |          |          make_op
                     |          |          |          transformExprRecurse
                     |          |          |          transformExpr
                     |          |          |          |
                     |          |          |          |--90.02%-- 
transformTargetEntry
                     |          |          |          | 
transformTargetList
                     |          |          |          | 
transformStmt
                     |          |          |          | 
parse_analyze
                     |          |          |          | 
pg_analyze_and_rewrite
                     |          |          |          | 
PostgresMain
                     |          |          |          |          ServerLoop
                     |          |          |          |
                     |          |          |           --9.98%-- 
transformWhereClause
                     |          |          | 
transformStmt
                     |          |          | 
parse_analyze
                     |          |          | 
pg_analyze_and_rewrite
                     |          |          | 
PostgresMain
                     |          |          |                     ServerLoop

With respect to IO, here are typical iostat outputs:

sda HW RAID 10 array SAS SSD [data]
md0 SW RAID 10 of nvme[0-3]n1 PCie SSD [xlog]

Non Checkpoint

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    15.00    0.00    3.00     0.00     0.07 
50.67     0.00    0.00    0.00    0.00   0.00   0.00
nvme0n1           0.00     0.00    0.00 4198.00     0.00   146.50 
71.47     0.18    0.05    0.00    0.05   0.04  18.40
nvme1n1           0.00     0.00    0.00 4198.00     0.00   146.50 
71.47     0.18    0.04    0.00    0.04   0.04  17.20
nvme2n1           0.00     0.00    0.00 4126.00     0.00   146.08 
72.51     0.15    0.04    0.00    0.04   0.03  14.00
nvme3n1           0.00     0.00    0.00 4125.00     0.00   146.03 
72.50     0.15    0.04    0.00    0.04   0.03  14.40
md0               0.00     0.00    0.00 16022.00     0.00   292.53 
37.39     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00   18.00     0.00     0.07 
8.44     0.00    0.00    0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00    0.00    0.00   0.00   0.00

Checkpoint

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    29.00    1.00 96795.00     0.00  1074.52 
22.73   133.13    1.38    4.00    1.38   0.01 100.00
nvme0n1           0.00     0.00    0.00 3564.00     0.00    56.71 
32.59     0.12    0.03    0.00    0.03   0.03  11.60
nvme1n1           0.00     0.00    0.00 3564.00     0.00    56.71 
32.59     0.12    0.03    0.00    0.03   0.03  12.00
nvme2n1           0.00     0.00    0.00 3884.00     0.00    59.12 
31.17     0.14    0.04    0.00    0.04   0.04  13.60
nvme3n1           0.00     0.00    0.00 3884.00     0.00    59.12 
31.17     0.13    0.03    0.00    0.03   0.03  12.80
md0               0.00     0.00    0.00 14779.00     0.00   115.80 
16.05     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    3.00     0.00     0.01 
8.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    1.00 96830.00     0.00  1074.83 
22.73   134.79    1.38    4.00    1.38   0.01 100.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00 
0.00     0.00    0.00    0.00    0.00   0.00   0.00

Thanks for your patience if you have read this far!

Regards

Mark