On Thu, Nov 15, 2012 at 3:49 PM, Merlin Moncure <mmoncure@xxxxxxxxx> wrote: > On Thu, Nov 15, 2012 at 2:44 PM, Vlad <marchenko@xxxxxxxxx> wrote: >> >>> >>> yeah. ok, nest steps: >>> *) can you confirm that postgres process is using high cpu (according >>> to top) during stall time >> >> >> yes, CPU is spread across a lot of postmasters >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 29863 pgsql 20 0 3636m 102m 36m R 19.1 0.3 0:01.33 postmaster >> 30277 pgsql 20 0 3645m 111m 37m R 16.8 0.3 0:01.27 postmaster >> 11966 pgsql 20 0 3568m 22m 15m R 15.1 0.1 0:00.66 postmaster >> 8073 pgsql 20 0 3602m 60m 26m S 13.6 0.2 0:00.77 postmaster >> 29780 pgsql 20 0 3646m 115m 43m R 13.6 0.4 0:01.13 postmaster >> 11865 pgsql 20 0 3606m 61m 23m S 12.8 0.2 0:01.87 postmaster >> 29379 pgsql 20 0 3603m 70m 30m R 12.8 0.2 0:00.80 postmaster >> 29727 pgsql 20 0 3616m 77m 31m R 12.5 0.2 0:00.81 postmaster >> >>> >>> *) if, so, please strace that process and save some of the log >> >> >> https://dl.dropbox.com/u/109778/stall_postmaster.log > > ok, excellent. reviewing the log, this immediately caught my eye: > > recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5 > recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"..., > 64, 0, NULL, NULL) = 64 > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 3000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 6000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 7000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 9000}) = 0 (Timeout) > semop(41713721, {{2, 1, 0}}, 1) = 0 > lseek(295, 0, SEEK_END) = 0 > lseek(296, 0, SEEK_END) = 8192 > > this is definitely pointing to spinlock issue. see: slock.c what you > are seeing here is a backend getting caught in a spin loop via a stuck > spinlock. 0-9 = 10 times. did you by any chance check the logs? did > any backends restart? we are looking for this: > > elog(PANIC, "stuck spinlock (%p) detected at %s:%d", > lock, file, line); > > Anything else going on in the log? The way to enable locks status is > via LWLOCK_STATS macro. But before doing any thing plz check logs for > error/panic. ah, scratch that. reading the code again, slock.c allows for for up to 1 second (that is, you can delay 1000 times). so the panic is unlikely to be in the log. also spinlock waits shouldn't increase by exactly 1000 us. investigating... merlin -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general