Re: PG 9.6 Slow inserts with long-lasting LWLocks

MichaelDBA <MichaelDBA@xxxxxxxxxxx> · Fri, 16 Mar 2018 09:42:34 -0400

Sporadic
 insert slowness could be due to lock delays (locktype=extend) due to 
many concurrent connections trying to insert into the same table at the 
same time. Each insert request may result in an extend lock (8k 
extension), which blocks other writers. What normally happens is the 
these extend locks happen so fast that you hardly ever see them in the 
pg_locks table, except in the case where many concurrent connections are
 trying to do inserts into the same table. The following query will show
 if this is the case if you execute it during the time the problem is 
occurring.

select * from pg_locks where granted = false and locktype = 'extend';

I don't know if this is your particular problem, but perhaps it 
is.

Regards,

Michael Vitale

   	Pavel Suderevsky
        Thursday,
 March 15, 2018 6:29 AM

  Hi, 

Well, 
unfortunately I still need community help.

-- Environment
OS: Centos 
CentOS Linux release 7.2.1511
Kernel:  
3.10.0-327.36.3.el7.x86_64
PostgreSQL: 9.6.3
-- Hardware
Server: Dell 
PowerEdge R430
CPU: Intel(R) Xeon(R) CPU E5-2680
 v3 @ 2.50GHz
Raid controller: PERC H730 Mini 
(1GB cache)
Disks: 8 x 10K RPM SAS 12GB/s 2.5 
(ST1200MM0088) in RAID 6
RAM: 192GB 
(M393A2G40DB0-CPB x 16)
For more detailed 
hardware info please see attached configuration.txt
-- postgresql.conf
max_connections
 = 2048
shared_buffers = 48GB
temp_buffers = 128MB
work_mem =
 256MB
maintenance_work_mem = 512MB
dynamic_shared_memory_type = posix
wal_level = hot_standby
min_wal_size
 = 4GB
max_wal_size = 32GB
huge_pages = on
+
numactl interleave=all
-- 
sysctl.conf 
kernel.shmmax=64424509440
kernel.shmall=4294967296
kernel.sem
 = 1024 32767 128 16384
fs.aio-max-nr=3145728
fs.file-max = 6815744
net.core.rmem_default=262144
net.core.rmem_max=4194304
net.core.wmem_default=262144
net.core.wmem_max=1048586
vm.nr_hugepages=33000
vm.dirty_background_bytes=67108864
vm.dirty_bytes=536870912
vm.min_free_kbytes=1048576
zone_reclaim_mode=0

Again: problem is the occasional long inserts that 
can happen 1-5 times per day on OLTP system.
No 
autovacuum performed during long inserts. WAL rate is 1-2Gb per hour, no
 correlation spotted with this issue.
Wait event
 "buffer_mapping" happen for appropriate transactions but not every time
 (maybe just not every time catched).
I have two
 suspects for such behaviour: I/O system and high concurrency.
There is a problem with one application that 
frequently recreates up to 90 sessions but investigation shows that 
there is no direct correlation between such sessions and long 
transactions, at least it is not the root cause of the issue (of course 
such app behaviour will be fixed).

The investigation and tracing with strace in 
particular showed that:
1. The only long event 
straced from postgres backends was <... semop resumed>.
2. Seems the whole host gets hung during such events. 

Example:
Java application located on separate host reports 
several long transactions:
123336.943 - 
[1239588mks]: event.insert-table
123336.943 - 
[1240827mks]: event.insert-table
123337.019 - 
[1292534mks]: event.insert-table
143353.542 - 
[5467657mks]: event.insert-table
143353.543 - 
[5468884mks]: event.insert-table
152338.763 - 
[1264588mks]: event.insert-table
152338.765 - 
[2054887mks]: event.insert-table

Strace output for event happened at 14:33 with 
particular known pid:
119971 14:33:48.075375 
epoll_wait(3,  <unfinished ...>
119971 
14:33:48.075696 <... epoll_wait resumed> {{EPOLLIN, {u32=27532016,
 u64=27532016}}}, 1, -1) = 1 <0.000313>
119971
 14:33:48.075792 recvfrom(9,  <unfinished ...>
119971 14:33:48.075866 <... recvfrom resumed> 
"B\0\0\3\27\0S_21\0\0*\0\1\0\1\0\1\0\0\0\0\0\1\0\1\0\0\0\0\0"..., 8192, 
0, NULL, NULL) = 807 <0.000066>
119971 
14:33:48.076243 semop(26706044, {{8, -1, 0}}, 1 <unfinished ...>
120019 14:33:48.119971 recvfrom(9,  <unfinished 
...>
119971 14:33:53.491029 <... semop 
resumed> ) = 0 <5.414772> 
119971 
14:33:53.500356 lseek(18, 0, SEEK_END <unfinished ...>
119971 14:33:53.500436 <... lseek resumed> ) = 
107790336 <0.000072>
119971 
14:33:53.500514 lseek(20, 0, SEEK_END <unfinished ...>

Checking strace 
long semop calls for whole day:
root@host 
[20180314 17:47:36]:/home/user$ egrep " <[1-9]." /tmp/strace | grep 
semop
119991 12:33:36 <... semop resumed> 
)   = 0 <1.419394>
119942 12:33:36 <...
 semop resumed> )   = 0 <1.422554>
119930
 12:33:36 <... semop resumed> )   = 0 <1.414916>
119988 12:33:36 <... semop resumed> )   = 0 
<1.213309>
119966 12:33:36 <... semop 
resumed> )   = 0 <1.237492>
119958 
14:33:53.489398 <... semop resumed> ) = 0 <5.455830>
120019 14:33:53.490613 <... semop resumed> ) = 0
 <5.284505>
119997 14:33:53.490638 <...
 semop resumed> ) = 0 <5.111661>
120000
 14:33:53.490649 <... semop resumed> ) = 0 <3.521992>
119991 14:33:53.490660 <... semop resumed> ) = 0
 <2.522460>
119988 14:33:53.490670 <...
 semop resumed> ) = 0 <5.252485>
120044
 14:33:53.490834 <... semop resumed> ) = 0 <1.718129>
119976 14:33:53.490852 <... semop resumed> ) = 0
 <2.489563>
119974 14:33:53.490862 <...
 semop resumed> ) = 0 <1.520801>
119984
 14:33:53.491011 <... semop resumed> ) = 0 <1.213411>
119971 14:33:53.491029 <... semop resumed> ) = 0
 <5.414772>
119969 14:33:53.491039 <...
 semop resumed> ) = 0 <2.275608>
119966
 14:33:53.491048 <... semop resumed> ) = 0 <2.526024>
119942 14:33:53.491058 <... semop resumed> ) = 0
 <5.448506>
119964 15:23:38.746394 <...
 semop resumed> ) = 0 <2.034851>
119960
 15:23:38.746426 <... semop resumed> ) = 0 <2.038321>
119966 15:23:38.752646 <... semop resumed> ) = 0
 <1.252342>

Also it was spotted that WALWriter Postgres backend 
also spend time in <semop resumed> during hangs.

Also I have 
application on db host that performs pg_stat_activity shapshots every 
500m and for example I can see that there were no snapshot between 
14:33:47 and 14:33:53.
Separate simple script on
 db host every ~100ms checks ps output for this application and writes 
it into the txt file. And we can see that while it usually performs 
about 7-8 times per second, between 14:33:47 and 14:33:53 it couldn't 
even perform enough ps calls. Strace for this backend showed that this 
process was hung in semop call. So it tells me that whole system gets 
hung.
14:33:40 TOTAL=7 wait_transaction_locked=0
 sleep_on_page=0 get_request=0 sleep_on_buffer=0
14:33:41
 TOTAL=7 wait_transaction_locked=0 sleep_on_page=0 get_request=0 
sleep_on_buffer=0
14:33:42 TOTAL=7 
wait_transaction_locked=0 sleep_on_page=0 get_request=0 
sleep_on_buffer=0
14:33:43 TOTAL=7 
wait_transaction_locked=0 sleep_on_page=0 get_request=0 
sleep_on_buffer=0
14:33:44 TOTAL=7 
wait_transaction_locked=0 sleep_on_page=0 get_request=0 
sleep_on_buffer=0
14:33:45 TOTAL=7 
wait_transaction_locked=0 sleep_on_page=0 get_request=0 
sleep_on_buffer=0
14:33:46 TOTAL=6 
wait_transaction_locked=0 sleep_on_page=0 get_request=0 
sleep_on_buffer=0
14:33:47 TOTAL=2 
wait_transaction_locked=0 sleep_on_page=0 get_request=0 
sleep_on_buffer=0
14:33:48 TOTAL=3 
wait_transaction_locked=0 sleep_on_page=0 get_request=0 
sleep_on_buffer=0
14:33:49 TOTAL=2 
wait_transaction_locked=0 sleep_on_page=0 get_request=0 
sleep_on_buffer=0
14:33:50 TOTAL=2 
wait_transaction_locked=0 sleep_on_page=0 get_request=0 
sleep_on_buffer=0
14:33:51 TOTAL=2 
wait_transaction_locked=0 sleep_on_page=0 get_request=0 
sleep_on_buffer=0
14:33:52 TOTAL=4 
wait_transaction_locked=0 sleep_on_page=0 get_request=0 
sleep_on_buffer=0
14:33:53 TOTAL=6 
wait_transaction_locked=0 sleep_on_page=0 get_request=0 
sleep_on_buffer=0
14:33:54 TOTAL=7 
wait_transaction_locked=0 sleep_on_page=0 get_request=0 
sleep_on_buffer=0
14:33:55 TOTAL=7 
wait_transaction_locked=0 sleep_on_page=0 get_request=0 
sleep_on_buffer=0

I understand that RAID-6 is not the best option, but I
 can't catch any evidence telling that system run out of 1GB RAID 
controller cache on writes.

Please assist in understanding meaning and nature of 
long semop calls appearances.

--
Regards,
Pavel Suderevsky

From: Pavel Suderevsky
Sent: Monday,
 January 22, 2018 21:43
To: pgsql-performance@xxxxxxxxxxxxxx
Subject:
 PG 9.6 Slow inserts with long-lasting LWLocks

Hi,

I 
have an issue with sporadic slow insert operations with query duration 
more than 1 sec while it takes about 50ms in average.

Configuration:
OS: Centos 7.2.151
PostgreSQL:
 9.6.3
CPU: Intel(R) Xeon(R) CPU 
E5-2680 v3 @ 2.50GHz
Memory:         
  total        used        free      shared     buff/cache  available
                Mem:         193166       10324       
 1856      44522      180985      137444
                Swap:             0           0       
    0
Storage: Well,  about 4gb of 
BBU write cache.

shared_buffers = 32gb
work_mem = 128mb
max_pred_locks_per_transaction
 = 8192

This can occur once a day or not happen for few days 
while system load is the same. "Inserts" are the prepared statement 
batches with 4-5 inserts.
Neither 
excessive memory usage nor disk or cpu utilizations have been catched.
Wal writing rates, checkpoints, anything else from 
pg_stat_* tables were checked and nothing embarrassing was found.

There
 are several scenarious of such long inserts were spotted:
1. No any locks catched (500ms check intervals)
2. Wait event is "buffer_mapping" - looks like the 
most common case
 snaphot 
time                                                    | state   | trx 
duration    | query duration   | wait_event_type | wait_event     | 
query
 2017-12-22 03:16:01.181014 | 
active | 00:00:00.535309 | 00:00:00.524729  | LWLockTranche   | 
buffer_mapping | INSERT INTO table..
 2017-12-22
 03:16:00.65814  | active | 00:00:00.012435 | 00:00:00.001855  | 
LWLockTranche   | buffer_mapping | INSERT INTO table..
3. Wait event is "SerializablePredicateLockListLock" 
(I believe the same root cause as previous case)
4. No any locks catched, but ~39 other backends in 
parallel are active 

I assumed that it can be somehow related to enabled 
NUMA, but it looks like memory is allocated evenly, zone_reclaim_mode is
 0.
numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 
30 32 34 36 38 40 42 44 46
node 0 
size: 130978 MB
node 0 free: 1251 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 
31 33 35 37 39 41 43 45 47
node 1 
size: 65536 MB
node 1 free: 42 MB
node distances:
node 
  0   1 
  0:  10  21 
  1:  21  10 

numastat -m

Per-node system memory usage (in MBs):
                          Node 0          Node 1     
      Total

 --------------- --------------- ---------------
MemTotal               130978.34        65536.00      
 196514.34
MemFree                  
1479.07          212.12         1691.20
MemUsed                129499.27        65323.88      
 194823.14
Active                  
72241.16        37254.56       109495.73
Inactive                47936.24        24205.40       
 72141.64
Active(anon)            
21162.41        18978.96        40141.37
Inactive(anon)           1061.94         7522.34       
  8584.27
Active(file)            
51078.76        18275.60        69354.36
Inactive(file)          46874.30        16683.06       
 63557.36
Unevictable                
 0.00            0.00            0.00
Mlocked 
                    0.00            0.00            0.00
Dirty                       0.04            0.02     
       0.05
Writeback                 
  0.00            0.00            0.00
FilePages              116511.36        60923.16      
 177434.52
Mapped                  
16507.29        23912.82        40420.11
AnonPages                3661.55          530.26       
  4191.81
Shmem                  
 18558.28        25964.74        44523.02
KernelStack                16.98            5.77       
    22.75
PageTables              
 3943.56         1022.25         4965.81
NFS_Unstable                0.00            0.00       
     0.00
Bounce                     
 0.00            0.00            0.00
WritebackTmp 
               0.00            0.00            0.00
Slab                     2256.09         1291.53       
  3547.61
SReclaimable            
 2108.29          889.85         2998.14
SUnreclaim                147.80          401.68       
   549.47
AnonHugePages            
1824.00          284.00         2108.00
HugePages_Total             0.00            0.00       
     0.00
HugePages_Free             
 0.00            0.00            0.00
HugePages_Surp 
             0.00            0.00            0.00

$ cat 
/proc/62679/numa_maps | grep N0 | grep zero
7f92509d3000 prefer:0 file=/dev/zero\040(deleted) 
dirty=8419116 mapmax=154 active=8193350 N0=3890534 N1=4528582 
kernelpagesize_kB=4

Could you advise what can cause such occasional long 
inserts with long-lasting LWlocks?