Hi, Well, unfortunately I still need community help. -- Environment OS: Centos CentOS Linux release 7.2.1511 Kernel: 3.10.0-327.36.3.el7.x86_64 PostgreSQL: 9.6.3 -- Hardware Server: Dell PowerEdge R430 CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Raid controller: PERC H730 Mini (1GB cache) Disks: 8 x 10K RPM SAS 12GB/s 2.5 (ST1200MM0088) in RAID 6 RAM: 192GB (M393A2G40DB0-CPB x 16) For more detailed hardware info please see attached configuration.txt -- postgresql.conf max_connections = 2048 shared_buffers = 48GB temp_buffers = 128MB work_mem = 256MB maintenance_work_mem = 512MB dynamic_shared_memory_type = posix wal_level = hot_standby min_wal_size = 4GB max_wal_size = 32GB huge_pages = on + numactl interleave=all -- sysctl.conf kernel.shmmax=64424509440 kernel.shmall=4294967296 kernel.sem = 1024 32767 128 16384 fs.aio-max-nr=3145728 fs.file-max = 6815744 net.core.rmem_default=262144 net.core.rmem_max=4194304 net.core.wmem_default=262144 net.core.wmem_max=1048586 vm.nr_hugepages=33000 vm.dirty_background_bytes=67108864 vm.dirty_bytes=536870912 vm.min_free_kbytes=1048576 zone_reclaim_mode=0 Again: problem is the occasional long inserts that can happen 1-5 times per day on OLTP system. No autovacuum performed during long inserts. WAL rate is 1-2Gb per hour, no correlation spotted with this issue. Wait event "buffer_mapping" happen for appropriate transactions but not every time (maybe just not every time catched). I have two suspects for such behaviour: I/O system and high concurrency. There is a problem with one application that frequently recreates up to 90 sessions but investigation shows that there is no direct correlation between such sessions and long transactions, at least it is not the root cause of the issue (of course such app behaviour will be fixed). The investigation and tracing with strace in particular showed that: 1. The only long event straced from postgres backends was <... semop resumed>. 2. Seems the whole host gets hung during such events. Example: Java application located on separate host reports several long transactions: 123336.943 - [1239588mks]: event.insert-table 123336.943 - [1240827mks]: event.insert-table 123337.019 - [1292534mks]: event.insert-table 143353.542 - [5467657mks]: event.insert-table 143353.543 - [5468884mks]: event.insert-table 152338.763 - [1264588mks]: event.insert-table 152338.765 - [2054887mks]: event.insert-table Strace output for event happened at 14:33 with particular known pid: 119971 14:33:48.075375 epoll_wait(3, <unfinished ...> 119971 14:33:48.075696 <... epoll_wait resumed> {{EPOLLIN, {u32=27532016, u64=27532016}}}, 1, -1) = 1 <0.000313> 119971 14:33:48.075792 recvfrom(9, <unfinished ...> 119971 14:33:48.075866 <... recvfrom resumed> "B\0\0\3\27\0S_21\0\0*\0\1\0\1\0\1\0\0\0\0\0\1\0\1\0\0\0\0\0"..., 8192, 0, NULL, NULL) = 807 <0.000066> 119971 14:33:48.076243 semop(26706044, {{8, -1, 0}}, 1 <unfinished ...> 120019 14:33:48.119971 recvfrom(9, <unfinished ...> 119971 14:33:53.491029 <... semop resumed> ) = 0 <5.414772> 119971 14:33:53.500356 lseek(18, 0, SEEK_END <unfinished ...> 119971 14:33:53.500436 <... lseek resumed> ) = 107790336 <0.000072> 119971 14:33:53.500514 lseek(20, 0, SEEK_END <unfinished ...> Checking strace long semop calls for whole day: root@host [20180314 17:47:36]:/home/user$ egrep " <[1-9]." /tmp/strace | grep semop 119991 12:33:36 <... semop resumed> ) = 0 <1.419394> 119942 12:33:36 <... semop resumed> ) = 0 <1.422554> 119930 12:33:36 <... semop resumed> ) = 0 <1.414916> 119988 12:33:36 <... semop resumed> ) = 0 <1.213309> 119966 12:33:36 <... semop resumed> ) = 0 <1.237492> 119958 14:33:53.489398 <... semop resumed> ) = 0 <5.455830> 120019 14:33:53.490613 <... semop resumed> ) = 0 <5.284505> 119997 14:33:53.490638 <... semop resumed> ) = 0 <5.111661> 120000 14:33:53.490649 <... semop resumed> ) = 0 <3.521992> 119991 14:33:53.490660 <... semop resumed> ) = 0 <2.522460> 119988 14:33:53.490670 <... semop resumed> ) = 0 <5.252485> 120044 14:33:53.490834 <... semop resumed> ) = 0 <1.718129> 119976 14:33:53.490852 <... semop resumed> ) = 0 <2.489563> 119974 14:33:53.490862 <... semop resumed> ) = 0 <1.520801> 119984 14:33:53.491011 <... semop resumed> ) = 0 <1.213411> 119971 14:33:53.491029 <... semop resumed> ) = 0 <5.414772> 119969 14:33:53.491039 <... semop resumed> ) = 0 <2.275608> 119966 14:33:53.491048 <... semop resumed> ) = 0 <2.526024> 119942 14:33:53.491058 <... semop resumed> ) = 0 <5.448506> 119964 15:23:38.746394 <... semop resumed> ) = 0 <2.034851> 119960 15:23:38.746426 <... semop resumed> ) = 0 <2.038321> 119966 15:23:38.752646 <... semop resumed> ) = 0 <1.252342> Also it was spotted that WALWriter Postgres backend also spend time in <semop resumed> during hangs. Also I have application on db host that performs pg_stat_activity shapshots every 500m and for example I can see that there were no snapshot between 14:33:47 and 14:33:53. Separate simple script on db host every ~100ms checks ps output for this application and writes it into the txt file. And we can see that while it usually performs about 7-8 times per second, between 14:33:47 and 14:33:53 it couldn't even perform enough ps calls. Strace for this backend showed that this process was hung in semop call. So it tells me that whole system gets hung. 14:33:40 TOTAL=7 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0 14:33:41 TOTAL=7 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0 14:33:42 TOTAL=7 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0 14:33:43 TOTAL=7 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0 14:33:44 TOTAL=7 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0 14:33:45 TOTAL=7 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0 14:33:46 TOTAL=6 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0 14:33:47 TOTAL=2 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0 14:33:48 TOTAL=3 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0 14:33:49 TOTAL=2 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0 14:33:50 TOTAL=2 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0 14:33:51 TOTAL=2 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0 14:33:52 TOTAL=4 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0 14:33:53 TOTAL=6 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0 14:33:54 TOTAL=7 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0 14:33:55 TOTAL=7 wait_transaction_locked=0 sleep_on_page=0 get_request=0 sleep_on_buffer=0 I understand that RAID-6 is not the best option, but I can't catch any evidence telling that system run out of 1GB RAID controller cache on writes. Please assist in understanding meaning and nature of long semop calls appearances. -- Regards, Pavel Suderevsky From: Pavel Suderevsky Hi, I have an issue with sporadic slow insert operations with query duration more than 1 sec while it takes about 50ms in average. Configuration: OS: Centos 7.2.151 PostgreSQL: 9.6.3 CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Memory: total used free shared buff/cache available Mem: 193166 10324 1856 44522 180985 137444 Swap: 0 0 0 Storage: Well, about 4gb of BBU write cache. shared_buffers = 32gb work_mem = 128mb max_pred_locks_per_transaction = 8192 This can occur once a day or not happen for few days while system load is the same. "Inserts" are the prepared statement batches with 4-5 inserts. Neither excessive memory usage nor disk or cpu utilizations have been catched. Wal writing rates, checkpoints, anything else from pg_stat_* tables were checked and nothing embarrassing was found. There are several scenarious of such long inserts were spotted: 1. No any locks catched (500ms check intervals) 2. Wait event is "buffer_mapping" - looks like the most common case snaphot time | state | trx duration | query duration | wait_event_type | wait_event | query 2017-12-22 03:16:01.181014 | active | 00:00:00.535309 | 00:00:00.524729 | LWLockTranche | buffer_mapping | INSERT INTO table.. 2017-12-22 03:16:00.65814 | active | 00:00:00.012435 | 00:00:00.001855 | LWLockTranche | buffer_mapping | INSERT INTO table.. 3. Wait event is "SerializablePredicateLockListLock" (I believe the same root cause as previous case) 4. No any locks catched, but ~39 other backends in parallel are active I assumed that it can be somehow related to enabled NUMA, but it looks like memory is allocated evenly, zone_reclaim_mode is 0. numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 node 0 size: 130978 MB node 0 free: 1251 MB node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 node 1 size: 65536 MB node 1 free: 42 MB node distances: node 0 1 0: 10 21 1: 21 10
numastat -m Per-node system memory usage (in MBs): Node 0 Node 1 Total --------------- --------------- --------------- MemTotal 130978.34 65536.00 196514.34 MemFree 1479.07 212.12 1691.20 MemUsed 129499.27 65323.88 194823.14 Active 72241.16 37254.56 109495.73 Inactive 47936.24 24205.40 72141.64 Active(anon) 21162.41 18978.96 40141.37 Inactive(anon) 1061.94 7522.34 8584.27 Active(file) 51078.76 18275.60 69354.36 Inactive(file) 46874.30 16683.06 63557.36 Unevictable 0.00 0.00 0.00 Mlocked 0.00 0.00 0.00 Dirty 0.04 0.02 0.05 Writeback 0.00 0.00 0.00 FilePages 116511.36 60923.16 177434.52 Mapped 16507.29 23912.82 40420.11 AnonPages 3661.55 530.26 4191.81 Shmem 18558.28 25964.74 44523.02 KernelStack 16.98 5.77 22.75 PageTables 3943.56 1022.25 4965.81 NFS_Unstable 0.00 0.00 0.00 Bounce 0.00 0.00 0.00 WritebackTmp 0.00 0.00 0.00 Slab 2256.09 1291.53 3547.61 SReclaimable 2108.29 889.85 2998.14 SUnreclaim 147.80 401.68 549.47 AnonHugePages 1824.00 284.00 2108.00 HugePages_Total 0.00 0.00 0.00 HugePages_Free 0.00 0.00 0.00 HugePages_Surp 0.00 0.00 0.00 $ cat /proc/62679/numa_maps | grep N0 | grep zero 7f92509d3000 prefer:0 file=/dev/zero\040(deleted) dirty=8419116 mapmax=154 active=8193350 N0=3890534 N1=4528582 kernelpagesize_kB=4 Could you advise what can cause such occasional long inserts with long-lasting LWlocks? |
root@dbhost [20180314 16:23:44]:/home/user$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Stepping: 2 CPU MHz: 2900.000 BogoMIPS: 4998.18 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 30720K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47 root@dbhost [20180314 17:06:47]:/home/user$ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 node 0 size: 130978 MB node 0 free: 7194 MB node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 node 1 size: 65536 MB node 1 free: 642 MB node distances: node 0 1 0: 10 21 1: 21 10 root@dbhost [20180314 16:23:46]:/home/user$ numastat -m Per-node system memory usage (in MBs): Node 0 Node 1 Total --------------- --------------- --------------- MemTotal 130978.34 65536.00 196514.34 MemFree 5897.14 426.93 6324.06 MemUsed 125081.20 65109.07 190190.28 Active 44828.92 16888.00 61716.92 Inactive 42311.52 12586.01 54897.53 Active(anon) 6113.09 4818.25 10931.33 Inactive(anon) 154.44 281.88 436.32 Active(file) 38715.84 12069.75 50785.59 Inactive(file) 42157.08 12304.12 54461.21 Unevictable 0.00 0.00 0.00 Mlocked 0.00 0.00 0.00 Dirty 6.55 0.30 6.85 Writeback 0.00 0.00 0.00 FilePages 81428.38 25263.89 106692.27 Mapped 58.56 26.13 84.69 AnonPages 5767.31 4265.87 10033.18 Shmem 499.60 834.26 1333.86 KernelStack 17.30 5.94 23.23 PageTables 40.75 36.96 77.71 NFS_Unstable 0.00 0.00 0.00 Bounce 0.00 0.00 0.00 WritebackTmp 0.00 0.00 0.00 Slab 1721.68 1056.38 2778.06 SReclaimable 1542.00 864.26 2406.26 SUnreclaim 179.68 192.12 371.80 AnonHugePages 62.00 28.00 90.00 HugePages_Total 33000.00 33000.00 66000.00 HugePages_Free 4450.00 4454.00 8904.00 HugePages_Surp 0.00 0.00 0.00 root@dbhost [20180314 16:25:36]:/home/user$ cat /etc/sysctl.conf kernel.shmmax=64424509440 kernel.shmall=4294967296 # kernel.sem=<SEMMSL> <SEMMNS> <SEMOPM> <SEMMNI> kernel.sem = 1024 32767 128 16384 fs.aio-max-nr=3145728 fs.file-max = 6815744 net.core.rmem_default=262144 net.core.rmem_max=4194304 net.core.wmem_default=262144 net.core.wmem_max=1048586 vm.nr_hugepages=33000 vm.dirty_background_bytes=67108864 vm.dirty_bytes=536870912 vm.min_free_kbytes=1048576 root@dbhost [20180314 16:25:49]:/home/user$ free -m total used free shared buff/cache available Mem: 193166 77128 6139 1333 109899 110458 Swap: 0 0 0 root@dbhost [20180314 16:28:07]:/home/user$ omreport storage controller Controller PERC H730 Mini(Embedded) Controller ID : 0 Status : Ok Name : PERC H730 Mini Slot ID : Embedded State : Ready Firmware Version : 25.5.0.0018 Minimum Required Firmware Version : Not Applicable Driver Version : 06.807.10.00-rh1 Minimum Required Driver Version : Not Applicable Storport Driver Version : Not Applicable Minimum Required Storport Driver Version : Not Applicable Number of Connectors : 2 Rebuild Rate : 30% BGI Rate : 30% Check Consistency Rate : 30% Reconstruct Rate : 30% Alarm State : Not Applicable Cluster Mode : Not Applicable SCSI Initiator ID : Not Applicable Cache Memory Size : 1024 MB Patrol Read Mode : Auto Patrol Read State : Active Patrol Read Rate : 30% Patrol Read Iterations : 117 Abort Check Consistency on Error : Disabled Allow Revertible Hot Spare and Replace Member : Enabled Load Balance : Not Applicable Auto Replace Member on Predictive Failure : Disabled Redundant Path view : Not Applicable CacheCade Capable : Not Applicable Persistent Hot Spare : Disabled Encryption Capable : Yes Encryption Key Present : No Encryption Mode : None Preserved Cache : Not Applicable Spin Down Unconfigured Drives : Disabled Spin Down Hot Spares : Disabled Spin Down Configured Drives : Disabled Automatic Disk Power Saving (Idle C) : Disabled Start Time (HH:MM) : Not Applicable Time Interval for Spin Up (in Hours) : Not Applicable T10 Protection Information Capable : No Non-RAID HDD Disk Cache Policy : Unchanged Current Controller Mode : RAID root@dbhost [20180314 16:39:01]:/home/user$ omreport storage vdisk controller=0 Virtual Disk 0 on Controller PERC H730 Mini (Embedded) Controller PERC H730 Mini (Embedded) ID : 0 Status : Ok Name : GenericR6_0 State : Ready Hot Spare Policy violated : Not Assigned Encrypted : No Layout : RAID-6 Size : 6,703.50 GB (7197828317184 bytes) T10 Protection Information Status : No Associated Fluid Cache State : Not Applicable Device Name : /dev/sda Bus Protocol : SAS Media : HDD Read Policy : No Read Ahead Write Policy : Write Back Cache Policy : Not Applicable Stripe Element Size : 64 KB Disk Cache Policy : Unchanged root@dbhost [20180314 16:31:29]:/home/user$ grep ^Huge /proc/meminfo HugePages_Total: 33000 HugePages_Free: 4452 HugePages_Rsvd: 2066 HugePages_Surp: 0 Hugepagesize: 2048 kB ===================== postgresql.conf max_connections = 2048 shared_buffers = 48GB temp_buffers = 128MB work_mem = 256MB maintenance_work_mem = 512MB dynamic_shared_memory_type = posix wal_level = hot_standby max_wal_senders = 5 wal_keep_segments = 2000 max_replication_slots = 5 min_wal_size = 4GB max_wal_size = 32GB archive_mode = on autovacuum_vacuum_threshold = 100 autovacuum_analyze_threshold = 100 autovacuum_vacuum_scale_factor = 0.02 autovacuum_analyze_scale_factor = 0.01 autovacuum_vacuum_cost_delay = 14 autovacuum_vacuum_cost_limit = 5000 autovacuum_max_workers = 4