RE: PG 9.6 Slow inserts with long-lasting LWLocks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

 

Well, unfortunately I still need community help.

 

-- Environment

OS: Centos CentOS Linux release 7.2.1511

Kernel:  3.10.0-327.36.3.el7.x86_64

PostgreSQL: 9.6.3

-- Hardware

Server: Dell PowerEdge R430

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Raid controller: PERC H730 Mini (1GB cache)

Disks: 8 x 10K RPM SAS 12GB/s 2.5 (ST1200MM0088) in RAID 6

RAM: 192GB (M393A2G40DB0-CPB x 16)

For more detailed hardware info please see attached configuration.txt

-- postgresql.conf

max_connections = 2048

shared_buffers = 48GB

temp_buffers = 128MB

work_mem = 256MB

maintenance_work_mem = 512MB

dynamic_shared_memory_type = posix

wal_level = hot_standby

min_wal_size = 4GB

max_wal_size = 32GB

huge_pages = on

+

numactl interleave=all

-- sysctl.conf

kernel.shmmax=64424509440

kernel.shmall=4294967296

kernel.sem = 1024 32767 128 16384

fs.aio-max-nr=3145728

fs.file-max = 6815744

net.core.rmem_default=262144

net.core.rmem_max=4194304

net.core.wmem_default=262144

net.core.wmem_max=1048586

vm.nr_hugepages=33000

vm.dirty_background_bytes=67108864

vm.dirty_bytes=536870912

vm.min_free_kbytes=1048576

zone_reclaim_mode=0

 

Again: problem is the occasional long inserts that can happen 1-5 times per day on OLTP system.

No autovacuum performed during long inserts. WAL rate is 1-2Gb per hour, no correlation spotted with this issue.

Wait event "buffer_mapping" happen for appropriate transactions but not every time (maybe just not every time catched).

I have two suspects for such behaviour: I/O system and high concurrency.

There is a problem with one application that frequently recreates up to 90 sessions but investigation shows that there is no direct correlation between such sessions and long transactions, at least it is not the root cause of the issue (of course such app behaviour will be fixed).

 

The investigation and tracing with strace in particular showed that:

1. The only long event straced from postgres backends was <... semop resumed>.

2. Seems the whole host gets hung during such events.

 

Example:

Java application located on separate host reports several long transactions:

123336.943 - [1239588mks]: event.insert-table

123336.943 - [1240827mks]: event.insert-table

123337.019 - [1292534mks]: event.insert-table

143353.542 - [5467657mks]: event.insert-table

143353.543 - [5468884mks]: event.insert-table

152338.763 - [1264588mks]: event.insert-table

152338.765 - [2054887mks]: event.insert-table

 

Strace output for event happened at 14:33 with particular known pid:

119971 14:33:48.075375 epoll_wait(3,  <unfinished ...>

119971 14:33:48.075696 <... epoll_wait resumed> {{EPOLLIN, {u32=27532016, u64=27532016}}}, 1, -1) = 1 <0.000313>

119971 14:33:48.075792 recvfrom(9,  <unfinished ...>

119971 14:33:48.075866 <... recvfrom resumed> "B\0\0\3\27\0S_21\0\0*\0\1\0\1\0\1\0\0\0\0\0\1\0\1\0\0\0\0\0"..., 8192, 0, NULL, NULL) = 807 <0.000066>

119971 14:33:48.076243 semop(26706044, {{8, -1, 0}}, 1 <unfinished ...>

120019 14:33:48.119971 recvfrom(9,  <unfinished ...>

119971 14:33:53.491029 <... semop resumed> ) = 0 <5.414772>

119971 14:33:53.500356 lseek(18, 0, SEEK_END <unfinished ...>

119971 14:33:53.500436 <... lseek resumed> ) = 107790336 <0.000072>

119971 14:33:53.500514 lseek(20, 0, SEEK_END <unfinished ...>

 

Checking strace long semop calls for whole day:

root@host [20180314 17:47:36]:/home/user$ egrep " <[1-9]." /tmp/strace | grep semop

119991 12:33:36 <... semop resumed> )   = 0 <1.419394>

119942 12:33:36 <... semop resumed> )   = 0 <1.422554>

119930 12:33:36 <... semop resumed> )   = 0 <1.414916>

119988 12:33:36 <... semop resumed> )   = 0 <1.213309>

119966 12:33:36 <... semop resumed> )   = 0 <1.237492>

119958 14:33:53.489398 <... semop resumed> ) = 0 <5.455830>

120019 14:33:53.490613 <... semop resumed> ) = 0 <5.284505>

119997 14:33:53.490638 <... semop resumed> ) = 0 <5.111661>

120000 14:33:53.490649 <... semop resumed> ) = 0 <3.521992>

119991 14:33:53.490660 <... semop resumed> ) = 0 <2.522460>

119988 14:33:53.490670 <... semop resumed> ) = 0 <5.252485>

120044 14:33:53.490834 <... semop resumed> ) = 0 <1.718129>

119976 14:33:53.490852 <... semop resumed> ) = 0 <2.489563>

119974 14:33:53.490862 <... semop resumed> ) = 0 <1.520801>

119984 14:33:53.491011 <... semop resumed> ) = 0 <1.213411>

119971 14:33:53.491029 <... semop resumed> ) = 0 <5.414772>

119969 14:33:53.491039 <... semop resumed> ) = 0 <2.275608>

119966 14:33:53.491048 <... semop resumed> ) = 0 <2.526024>

119942 14:33:53.491058 <... semop resumed> ) = 0 <5.448506>

119964 15:23:38.746394 <... semop resumed> ) = 0 <2.034851>

119960 15:23:38.746426 <... semop resumed> ) = 0 <2.038321>

119966 15:23:38.752646 <... semop resumed> ) = 0 <1.252342>

 

Also it was spotted that WALWriter Postgres backend also spend time in <semop resumed> during hangs.

 

Also I have application on db host that performs pg_stat_activity shapshots every 500m and for example I can see that there were no snapshot between 14:33:47 and 14:33:53.

Separate simple script on db host every ~100ms checks ps output for this application and writes it into the txt file. And we can see that while it usually performs about 7-8 times per second, between 14:33:47 and 14:33:53 it couldn't even perform enough ps calls. Strace for this backend showed that this process was hung in semop call. So it tells me that whole system gets hung.

14:33:40 TOTAL=7 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0

14:33:41 TOTAL=7 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0

14:33:42 TOTAL=7 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0

14:33:43 TOTAL=7 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0

14:33:44 TOTAL=7 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0

14:33:45 TOTAL=7 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0

14:33:46 TOTAL=6 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0

14:33:47 TOTAL=2 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0

14:33:48 TOTAL=3 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0

14:33:49 TOTAL=2 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0

14:33:50 TOTAL=2 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0

14:33:51 TOTAL=2 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0

14:33:52 TOTAL=4 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0

14:33:53 TOTAL=6 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0

14:33:54 TOTAL=7 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0

14:33:55 TOTAL=7 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0

 

I understand that RAID-6 is not the best option, but I can't catch any evidence telling that system run out of 1GB RAID controller cache on writes.

 

Please assist in understanding meaning and nature of long semop calls appearances.

 

--

Regards,

Pavel Suderevsky

 

 

 

From: Pavel Suderevsky
Sent: Monday, January 22, 2018 21:43
To: pgsql-performance@xxxxxxxxxxxxxx
Subject: PG 9.6 Slow inserts with long-lasting LWLocks

 

Hi,

 

I have an issue with sporadic slow insert operations with query duration more than 1 sec while it takes about 50ms in average.

 

Configuration:

OS: Centos 7.2.151

PostgreSQL: 9.6.3

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Memory:           total        used        free      shared     buff/cache  available

                Mem:         193166       10324        1856      44522      180985      137444

                Swap:             0           0           0

Storage: Well,  about 4gb of BBU write cache.

 

shared_buffers = 32gb

work_mem = 128mb

max_pred_locks_per_transaction = 8192

 

This can occur once a day or not happen for few days while system load is the same. "Inserts" are the prepared statement batches with 4-5 inserts.

Neither excessive memory usage nor disk or cpu utilizations have been catched.

Wal writing rates, checkpoints, anything else from pg_stat_* tables were checked and nothing embarrassing was found.

 

There are several scenarious of such long inserts were spotted:

1. No any locks catched (500ms check intervals)

2. Wait event is "buffer_mapping" - looks like the most common case

 snaphot time                                                    | state  | trx duration    | query duration   | wait_event_type | wait_event     | query

 2017-12-22 03:16:01.181014 | active | 00:00:00.535309 | 00:00:00.524729  | LWLockTranche   | buffer_mapping | INSERT INTO table..

 2017-12-22 03:16:00.65814  | active | 00:00:00.012435 | 00:00:00.001855  | LWLockTranche   | buffer_mapping | INSERT INTO table..

3. Wait event is "SerializablePredicateLockListLock" (I believe the same root cause as previous case)

4. No any locks catched, but ~39 other backends in parallel are active 

 

I assumed that it can be somehow related to enabled NUMA, but it looks like memory is allocated evenly, zone_reclaim_mode is 0.

numactl --hardware

available: 2 nodes (0-1)

node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46

node 0 size: 130978 MB

node 0 free: 1251 MB

node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47

node 1 size: 65536 MB

node 1 free: 42 MB

node distances:

node   0   1 

  0:  10  21 

  1:  21  10 

  

numastat -m

 

Per-node system memory usage (in MBs):

                          Node 0          Node 1           Total

                 --------------- --------------- ---------------

MemTotal               130978.34        65536.00       196514.34

MemFree                  1479.07          212.12         1691.20

MemUsed                129499.27        65323.88       194823.14

Active                  72241.16        37254.56       109495.73

Inactive                47936.24        24205.40        72141.64

Active(anon)            21162.41        18978.96        40141.37

Inactive(anon)           1061.94         7522.34         8584.27

Active(file)            51078.76        18275.60        69354.36

Inactive(file)          46874.30        16683.06        63557.36

Unevictable                 0.00            0.00            0.00

Mlocked                     0.00            0.00            0.00

Dirty                       0.04            0.02            0.05

Writeback                   0.00            0.00            0.00

FilePages              116511.36        60923.16       177434.52

Mapped                  16507.29        23912.82        40420.11

AnonPages                3661.55          530.26         4191.81

Shmem                   18558.28        25964.74        44523.02

KernelStack                16.98            5.77           22.75

PageTables               3943.56         1022.25         4965.81

NFS_Unstable                0.00            0.00            0.00

Bounce                      0.00            0.00            0.00

WritebackTmp                0.00            0.00            0.00

Slab                     2256.09         1291.53         3547.61

SReclaimable             2108.29          889.85         2998.14

SUnreclaim                147.80          401.68          549.47

AnonHugePages            1824.00          284.00         2108.00

HugePages_Total             0.00            0.00            0.00

HugePages_Free              0.00            0.00            0.00

HugePages_Surp              0.00            0.00            0.00

 

$ cat /proc/62679/numa_maps | grep N0 | grep zero

7f92509d3000 prefer:0 file=/dev/zero\040(deleted) dirty=8419116 mapmax=154 active=8193350 N0=3890534 N1=4528582 kernelpagesize_kB=4

 

Could you advise what can cause such occasional long inserts with long-lasting LWlocks?

 

 

root@dbhost [20180314 16:23:44]:/home/user$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Stepping:              2
CPU MHz:               2900.000
BogoMIPS:              4998.18
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47

root@dbhost [20180314 17:06:47]:/home/user$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
node 0 size: 130978 MB
node 0 free: 7194 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47
node 1 size: 65536 MB
node 1 free: 642 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
  
root@dbhost [20180314 16:23:46]:/home/user$ numastat -m

Per-node system memory usage (in MBs):
                          Node 0          Node 1           Total
                 --------------- --------------- ---------------
MemTotal               130978.34        65536.00       196514.34
MemFree                  5897.14          426.93         6324.06
MemUsed                125081.20        65109.07       190190.28
Active                  44828.92        16888.00        61716.92
Inactive                42311.52        12586.01        54897.53
Active(anon)             6113.09         4818.25        10931.33
Inactive(anon)            154.44          281.88          436.32
Active(file)            38715.84        12069.75        50785.59
Inactive(file)          42157.08        12304.12        54461.21
Unevictable                 0.00            0.00            0.00
Mlocked                     0.00            0.00            0.00
Dirty                       6.55            0.30            6.85
Writeback                   0.00            0.00            0.00
FilePages               81428.38        25263.89       106692.27
Mapped                     58.56           26.13           84.69
AnonPages                5767.31         4265.87        10033.18
Shmem                     499.60          834.26         1333.86
KernelStack                17.30            5.94           23.23
PageTables                 40.75           36.96           77.71
NFS_Unstable                0.00            0.00            0.00
Bounce                      0.00            0.00            0.00
WritebackTmp                0.00            0.00            0.00
Slab                     1721.68         1056.38         2778.06
SReclaimable             1542.00          864.26         2406.26
SUnreclaim                179.68          192.12          371.80
AnonHugePages              62.00           28.00           90.00
HugePages_Total         33000.00        33000.00        66000.00
HugePages_Free           4450.00         4454.00         8904.00
HugePages_Surp              0.00            0.00            0.00


root@dbhost [20180314 16:25:36]:/home/user$ cat /etc/sysctl.conf
kernel.shmmax=64424509440
kernel.shmall=4294967296
# kernel.sem=<SEMMSL> <SEMMNS> <SEMOPM> <SEMMNI>
kernel.sem = 1024 32767 128 16384
fs.aio-max-nr=3145728
fs.file-max = 6815744
net.core.rmem_default=262144
net.core.rmem_max=4194304
net.core.wmem_default=262144
net.core.wmem_max=1048586
vm.nr_hugepages=33000
vm.dirty_background_bytes=67108864
vm.dirty_bytes=536870912
vm.min_free_kbytes=1048576


root@dbhost [20180314 16:25:49]:/home/user$ free -m
              total        used        free      shared  buff/cache   available
Mem:         193166       77128        6139        1333      109899      110458
Swap:             0           0           0


root@dbhost [20180314 16:28:07]:/home/user$ omreport storage controller
 Controller  PERC H730 Mini(Embedded)

Controller
ID                                            : 0
Status                                        : Ok
Name                                          : PERC H730 Mini
Slot ID                                       : Embedded
State                                         : Ready
Firmware Version                              : 25.5.0.0018
Minimum Required Firmware Version             : Not Applicable
Driver Version                                : 06.807.10.00-rh1
Minimum Required Driver Version               : Not Applicable
Storport Driver Version                       : Not Applicable
Minimum Required Storport Driver Version      : Not Applicable
Number of Connectors                          : 2
Rebuild Rate                                  : 30%
BGI Rate                                      : 30%
Check Consistency Rate                        : 30%
Reconstruct Rate                              : 30%
Alarm State                                   : Not Applicable
Cluster Mode                                  : Not Applicable
SCSI Initiator ID                             : Not Applicable
Cache Memory Size                             : 1024 MB
Patrol Read Mode                              : Auto
Patrol Read State                             : Active
Patrol Read Rate                              : 30%
Patrol Read Iterations                        : 117
Abort Check Consistency on Error              : Disabled
Allow Revertible Hot Spare and Replace Member : Enabled
Load Balance                                  : Not Applicable
Auto Replace Member on Predictive Failure     : Disabled
Redundant Path view                           : Not Applicable
CacheCade Capable                             : Not Applicable
Persistent Hot Spare                          : Disabled
Encryption Capable                            : Yes
Encryption Key Present                        : No
Encryption Mode                               : None
Preserved Cache                               : Not Applicable
Spin Down Unconfigured Drives                 : Disabled
Spin Down Hot Spares                          : Disabled
Spin Down Configured Drives                   : Disabled
Automatic Disk Power Saving (Idle C)          : Disabled
Start Time (HH:MM)                            : Not Applicable
Time Interval for Spin Up (in Hours)          : Not Applicable
T10 Protection Information Capable            : No
Non-RAID HDD Disk Cache Policy                : Unchanged
Current Controller Mode                       : RAID


root@dbhost [20180314 16:39:01]:/home/user$ omreport storage vdisk controller=0
Virtual Disk 0 on Controller PERC H730 Mini (Embedded)

Controller PERC H730 Mini (Embedded)
ID                                : 0
Status                            : Ok
Name                              : GenericR6_0
State                             : Ready
Hot Spare Policy violated         : Not Assigned
Encrypted                         : No
Layout                            : RAID-6
Size                              : 6,703.50 GB (7197828317184 bytes)
T10 Protection Information Status : No
Associated Fluid Cache State      : Not Applicable
Device Name                       : /dev/sda
Bus Protocol                      : SAS
Media                             : HDD
Read Policy                       : No Read Ahead
Write Policy                      : Write Back
Cache Policy                      : Not Applicable
Stripe Element Size               : 64 KB
Disk Cache Policy                 : Unchanged

root@dbhost [20180314 16:31:29]:/home/user$ grep ^Huge /proc/meminfo
HugePages_Total:   33000
HugePages_Free:     4452
HugePages_Rsvd:     2066
HugePages_Surp:        0
Hugepagesize:       2048 kB



=====================
postgresql.conf

max_connections = 2048

shared_buffers = 48GB
temp_buffers = 128MB
work_mem = 256MB
maintenance_work_mem = 512MB
dynamic_shared_memory_type = posix

wal_level = hot_standby
max_wal_senders = 5
wal_keep_segments = 2000
max_replication_slots = 5

min_wal_size = 4GB
max_wal_size = 32GB

archive_mode = on

autovacuum_vacuum_threshold  = 100
autovacuum_analyze_threshold = 100

autovacuum_vacuum_scale_factor  = 0.02
autovacuum_analyze_scale_factor = 0.01

autovacuum_vacuum_cost_delay = 14
autovacuum_vacuum_cost_limit = 5000
autovacuum_max_workers = 4

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux