05.03.2015 11:25, Jim Nasby пишет:
On 2/27/15 5:11 AM, Sergey Shchukin wrote: Hi Jim, max_standby_streaming_delay and max_standby_archive_delay both are 30s on master and replica dbs I don't see any specific or heavy workload during this issue with a hanging apply process. Just a normal queries as usual. But I see an increased disk activity during the time when the apply issue is ongoing DSK | sdc | | busy 61% | read 11511 | | write 4534 | KiB/r 46 | | KiB/w 4 | MBr/s 52.78 | | MBw/s 1.88 | avq 1.45 | | avio 0.38 ms | DSK | sde | | busy 60% | read 11457 | | write 4398 | KiB/r 46 | | KiB/w 4 | MBr/s 51.97 | | MBw/s 1.83 | avq 1.47 | | avio 0.38 ms | DSK | sdd | | busy 60% | read 9673 | | write 4538 | KiB/r 61 | | KiB/w 4 | MBr/s 58.24 | | MBw/s 1.88 | avq 1.47 | | avio 0.42 ms | DSK | sdj | | busy 59% | read 9576 | | write 4177 | KiB/r 63 | | KiB/w 4 | MBr/s 59.30 | | MBw/s 1.75 | avq 1.48 | | avio 0.43 ms | DSK | sdh | | busy 59% | read 9615 | | write 4305 | KiB/r 63 | | KiB/w 4 | MBr/s 59.23 | | MBw/s 1.80 | avq 1.48 | | avio 0.42 ms | DSK | sdf | | busy 59% | read 9483 | | write 4404 | KiB/r 63 | | KiB/w 4 | MBr/s 59.11 | | MBw/s 1.83 | avq 1.47 | | avio 0.42 ms | DSK | sdi | | busy 59% | read 11273 | | write 4173 | KiB/r 46 | | KiB/w 4 | MBr/s 51.50 | | MBw/s 1.75 | avq 1.43 | | avio 0.38 ms | DSK | sdg | | busy 59% | read 11406 | | write 4297 | KiB/r 46 | | KiB/w 4 | MBr/s 51.66 | | MBw/s 1.80 | avq 1.46 | | avio 0.37 ms | Although it's not seems to be an upper IO limit. Normally disks are busy at 20-45% DSK | sde | | busy 29% | read 6524 | | write 14426 | KiB/r 26 | | KiB/w 5 | MBr/s 17.08 | | MBw/s 7.78 | avq 10.46 | | avio 0.14 ms | DSK | sdi | | busy 29% | read 6590 | | write 14391 | KiB/r 26 | | KiB/w 5 | MBr/s 17.19 | | MBw/s 7.76 | avq 8.75 | | avio 0.14 ms | DSK | sdg | | busy 29% | read 6547 | | write 14401 | KiB/r 26 | | KiB/w 5 | MBr/s 16.94 | | MBw/s 7.60 | avq 7.28 | | avio 0.14 ms | DSK | sdc | | busy 29% | read 6835 | | write 14283 | KiB/r 27 | | KiB/w 5 | MBr/s 18.08 | | MBw/s 7.74 | avq 8.77 | | avio 0.14 ms | DSK | sdf | | busy 23% | read 3808 | | write 14391 | KiB/r 36 | | KiB/w 5 | MBr/s 13.49 | | MBw/s 7.78 | avq 12.88 | | avio 0.13 ms | DSK | sdd | | busy 23% | read 3747 | | write 14229 | KiB/r 33 | | KiB/w 5 | MBr/s 12.32 | | MBw/s 7.74 | avq 10.07 | | avio 0.13 ms | DSK | sdj | | busy 23% | read 3737 | | write 14336 | KiB/r 36 | | KiB/w 5 | MBr/s 13.16 | | MBw/s 7.76 | avq 10.48 | | avio 0.13 ms | DSK | sdh | | busy 23% | read 3793 | | write 14362 | KiB/r 35 | | KiB/w 5 | MBr/s 13.29 | | MBw/s 7.60 | avq 8.61 | | avio 0.13 ms | Also during the issue perf shows [k] copy_user_generic_string on the top positions 14.09% postmaster postgres [.] 0x00000000001b4569 10.25% postmaster [kernel.kallsyms] [k] copy_user_generic_string 4.15% postmaster postgres [.] hash_search_with_hash_value 2.08% postmaster postgres [.] SearchCatCache 1.79% postmaster postgres [.] LWLockAcquire 1.18% postmaster libc-2.12.so [.] memcpy 1.12% postmaster postgres [.] mdnblocks Issue starts: at 19:43 Mon Mar 16 19:43:04 MSK 2015 Stream: MASTER-rdb04d:70837172337784 SLAVE:70837172314864 Replay:70837172316512 :: REPLAY 21 KBytes (00:00:00.006225 seconds) Mon Mar 16 19:43:09 MSK 2015 Stream: MASTER-rdb04d:70837177455624 SLAVE:70837177390968 Replay:70837176794376 :: REPLAY 646 KBytes (00:00:00.367305 seconds) Mon Mar 16 19:43:14 MSK 2015 Stream: MASTER-rdb04d:70837185005120 SLAVE:70837184961280 Replay:70837183253896 :: REPLAY 1710 KBytes (00:00:00.827881 seconds) Mon Mar 16 19:43:19 MSK 2015 Stream: MASTER-rdb04d:70837190417984 SLAVE:70837190230232 Replay:70837183253896 :: REPLAY 6996 KBytes (00:00:05.873169 seconds) Mon Mar 16 19:43:24 MSK 2015 Stream: MASTER-rdb04d:70837198538232 SLAVE:70837198485000 Replay:70837183253896 :: REPLAY 14926 KBytes (00:00:11.025561 seconds) Mon Mar 16 19:43:29 MSK 2015 Stream: MASTER-rdb04d:70837209961192 SLAVE:70837209869384 Replay:70837183253896 :: REPLAY 26081 KBytes (00:00:16.068014 seconds) We see [k] copy_user_generic_string 12.90% postmaster [kernel.kallsyms] [k] copy_user_generic_string 11.49% postmaster postgres [.] 0x00000000001f40c1 4.74% postmaster postgres [.] hash_search_with_hash_value 1.86% postmaster postgres [.] mdnblocks 1.73% postmaster postgres [.] LWLockAcquire 1.67% postmaster postgres [.] SearchCatCache 25.71% postmaster [kernel.kallsyms] [k] copy_user_generic_string 7.89% postmaster postgres [.] hash_search_with_hash_value 4.66% postmaster postgres [.] 0x00000000002108da 4.51% postmaster postgres [.] mdnblocks 3.36% postmaster [kernel.kallsyms] [k] put_page Issue stops: at 19:51:39 Mon Mar 16 19:51:24 MSK 2015 Stream: MASTER-rdb04d:70838904179344 SLAVE:70838903934392 Replay:70837183253896 :: REPLAY 1680591 KBytes (00:08:10.384679 seconds) Mon Mar 16 19:51:29 MSK 2015 Stream: MASTER-rdb04d:70838929994336 SLAVE:70838929873624 Replay:70837183253896 :: REPLAY 1705801 KBytes (00:08:15.428773 seconds) Mon Mar 16 19:51:34 MSK 2015 Stream: MASTER-rdb04d:70838951993624 SLAVE:70838951899768 Replay:70837183253896 :: REPLAY 1727285 KBytes (00:08:20.472567 seconds) Mon Mar 16 19:51:39 MSK 2015 Stream: MASTER-rdb04d:70838975297912 SLAVE:70838975180384 Replay:70837208050872 :: REPLAY 1725827 KBytes (00:08:10.256935 seconds) Mon Mar 16 19:51:44 MSK 2015 Stream: MASTER-rdb04d:70839001502160 SLAVE:70839001412616 Replay:70837260116984 :: REPLAY 1700572 KBytes (00:07:49.849511 seconds) Mon Mar 16 19:51:49 MSK 2015 Stream: MASTER-rdb04d:70839022866760 SLAVE:70839022751184 Replay:70837276732880 :: REPLAY 1705209 KBytes (00:07:42.307364 seconds) And copy_user_generic_string goes down + 13.43% postmaster postgres [.] 0x000000000023dc9a + 3.71% postmaster [kernel.kallsyms] [k] copy_user_generic_string + 2.46% init [kernel.kallsyms] [k] intel_idle + 2.30% postmaster postgres [.] hash_search_with_hash_value + 2.01% postmaster postgres [.] SearchCatCache Could you clarify what types of traces did you mean? GDB? To calculate slave and apply lag I use the following query at the replica site slave_lag=$($psql -U monitor -h$s_host -p$s_port -A -t -c "SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(), '0/0') AS receive" $p_db) replay_lag=$($psql -U monitor -h$s_host -p$s_port -A -t -c "SELECT pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/0') AS replay" $p_db) replay_timediff=$($psql -U monitor -h$s_host -p$s_port -A -t -c "SELECT NOW() - pg_last_xact_replay_timestamp() AS replication_delay" $p_db) master_lag=$($psql -U monitor -h$p_host -p$p_port -A -t -c "SELECT pg_xlog_location_diff(pg_current_xlog_location(), '0/0') AS offset" $p_db) echo "$(date) Stream: MASTER-$p_host:$master_lag SLAVE:$slave_lag Replay:$replay_lag :: REPLAY $(bc <<< $master_lag/1024-$replay_lag/1024) KBytes (${replay_timediff} seconds)" - Best regards, Sergey Shchukin |