things to check:
*) blocked queries (pg_locks/pg_stat_activity)
nada
*) i/o wait. in particular, is linux page cache flushing.
no i/o wait, no IRQ (see below)
*) query stampede: we will want to measure TPS leading into the stall
and out of it.
*) anything else running on the box?
just bare linux + postgresql.
*) you have a large amount of table? maybe this statistics file related?
over 1000 tables across 4 or 5 schemas in a single database.
*) let's log checkpoints to see if there is correlation with stall
checked, checkpoints happen must more rarely and w/o relation to a high-sys periods
*) nice to have vmstat/iostat during/before/after stall. for example,
massive spike of context switches during stall could point to o/s
scheduler issue.
checked that as well - nothing. CS even lower.
avg-cpu: %user %nice %system %iowait %steal %idle
16.94 0.00 9.28 0.38 0.00 73.40
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 6.00 48.00 0.00 48 0
avg-cpu: %user %nice %system %iowait %steal %idle
18.06 0.00 18.43 0.25 0.00 63.26
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 10.00 104.00 0.00 104 0
avg-cpu: %user %nice %system %iowait %steal %idle
9.12 0.00 86.74 0.12 0.00 4.01
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 0.97 7.77 0.00 8 0
avg-cpu: %user %nice %system %iowait %steal %idle
1.44 0.00 96.58 0.00 0.00 1.98
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 3.28 78.69 0.00 144 0
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 279240 12016 14431964 0 0 32 0 197852 4299 15 9 76 0 0
4 0 0 225984 12024 14419696 0 0 0 64 197711 5158 11 9 79 1 0
0 0 0 260112 12024 14413636 0 0 48 0 196708 4618 17 10 73 0 0
6 0 0 233936 12024 14375784 0 0 104 0 179861 4884 19 17 64 0 0
30 0 0 224904 12024 14354812 0 0 8 0 51088 1205 9 86 5 0 0
72 0 0 239144 12024 14333852 0 0 144 0 45601 542 2 98 0 0 0
78 0 0 224840 12024 14328536 0 0 0 0 38732 481 2 94 5 0 0
22 1 0 219072 12032 14250652 0 0 136 100 47323 1231 9 90 1 0 0