Re: High SYS CPU - need advise

Merlin Moncure <mmoncure@xxxxxxxxx> · Thu, 15 Nov 2012 16:16:06 -0600

On Thu, Nov 15, 2012 at 3:49 PM, Merlin Moncure <mmoncure@xxxxxxxxx> wrote:
> On Thu, Nov 15, 2012 at 2:44 PM, Vlad <marchenko@xxxxxxxxx> wrote:
>>
>>>
>>> yeah.  ok, nest steps:
>>> *) can you confirm that postgres process is using high cpu (according
>>> to top) during stall time
>>
>>
>> yes, CPU is spread across a lot of postmasters
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> 29863 pgsql     20   0 3636m 102m  36m R 19.1  0.3   0:01.33 postmaster
>> 30277 pgsql     20   0 3645m 111m  37m R 16.8  0.3   0:01.27 postmaster
>> 11966 pgsql     20   0 3568m  22m  15m R 15.1  0.1   0:00.66 postmaster
>>  8073 pgsql     20   0 3602m  60m  26m S 13.6  0.2   0:00.77 postmaster
>> 29780 pgsql     20   0 3646m 115m  43m R 13.6  0.4   0:01.13 postmaster
>> 11865 pgsql     20   0 3606m  61m  23m S 12.8  0.2   0:01.87 postmaster
>> 29379 pgsql     20   0 3603m  70m  30m R 12.8  0.2   0:00.80 postmaster
>> 29727 pgsql     20   0 3616m  77m  31m R 12.5  0.2   0:00.81 postmaster
>>
>>>
>>> *) if, so, please strace that process and save some of the log
>>
>>
>> https://dl.dropbox.com/u/109778/stall_postmaster.log
>
> ok, excellent.   reviewing the log, this immediately caught my eye:
>
> recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
> recvfrom(8, "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
> 64, 0, NULL, NULL) = 64
> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 3000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 6000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 7000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 9000})  = 0 (Timeout)
> semop(41713721, {{2, 1, 0}}, 1)         = 0
> lseek(295, 0, SEEK_END)                 = 0
> lseek(296, 0, SEEK_END)                 = 8192
>
> this is definitely pointing to spinlock issue. see: slock.c what you
> are seeing here is a backend getting caught in a spin loop via a stuck
> spinlock. 0-9 = 10 times.  did you by any chance check the logs? did
> any backends restart? we are looking for this:
>
>         elog(PANIC, "stuck spinlock (%p) detected at %s:%d",
>                  lock, file, line);
>
> Anything else going on in the log?   The way to enable locks status is
> via LWLOCK_STATS macro.  But before doing any thing plz check logs for
> error/panic.

ah, scratch that.  reading the code again, slock.c allows for for up
to 1 second (that is, you can delay 1000 times).  so the panic is
unlikely to be in the log.  also spinlock waits shouldn't increase by
exactly 1000 us. investigating...

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general