yeah. ok, nest steps:
*) can you confirm that postgres process is using high cpu (according
to top) during stall time
yes, CPU is spread across a lot of postmasters
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29863 pgsql 20 0 3636m 102m 36m R 19.1 0.3 0:01.33 postmaster
30277 pgsql 20 0 3645m 111m 37m R 16.8 0.3 0:01.27 postmaster
11966 pgsql 20 0 3568m 22m 15m R 15.1 0.1 0:00.66 postmaster
8073 pgsql 20 0 3602m 60m 26m S 13.6 0.2 0:00.77 postmaster
29780 pgsql 20 0 3646m 115m 43m R 13.6 0.4 0:01.13 postmaster
11865 pgsql 20 0 3606m 61m 23m S 12.8 0.2 0:01.87 postmaster
29379 pgsql 20 0 3603m 70m 30m R 12.8 0.2 0:00.80 postmaster
29727 pgsql 20 0 3616m 77m 31m R 12.5 0.2 0:00.81 postmaster
*) if, so, please strace that process and save some of the log
*) you're using a 'bleeding edge' kernel. so we must be suspicious of
a regression there, particularly in the scheduler.
this was observed for a while, during which period system went from using 3.4.* kernels to 3.5.*... but I do not deny such a possibility.
*) I am suspicious of spinlock issue. so, if we can't isolate the
problem, is running a hand complied postgres a possibility (for lock
stats)?
Yes, definitely possible. we run manually compiled postgresql anyway. Pls, provide instructions.
*) what is the output of this:
echo /proc/sys/vm/zone_reclaim_mode
I presume you wanted cat instead of echo, and it shows 0.
-- vlad