Re: High SYS CPU - need advise

Merlin Moncure <mmoncure@xxxxxxxxx> · Fri, 16 Nov 2012 10:21:16 -0600

On Fri, Nov 16, 2012 at 9:52 AM, Vlad <marchenko@xxxxxxxxx> wrote:
> Merlin,
>
>
>> Yeah -- you're right, this is definitely spinlock issue.  Next steps:
>>
>> *) in mostly read workloads, we have a couple of known frequent
>> offenders.  In particular the 'BufFreelistLock'.  One way we can
>> influence that guy is to try and significantly lower/raise shared
>> buffers.  So this is one thing to try.
>
>
> server has 32gb with 3.2gb dedicated for share buffers. I've increased it to
> 13.2gb, the stall still happened (I have a way of controlling number of
> queries hitting postgresql by shutting down own memcache-based app cache, so
> to test for stall I temporary shut down few memcached servers).
>
>
>> *) failing that, LWLOCK_STATS macro can be compiled in to give us some
>> information about the particular lock(s) we're binding on.  Hopefully
>> it's a lwlock -- this will make diagnosing the problem easier.
>
>
> I've enabled that macro, seeing flying lwlock messages in the log (see
> below), even when there is no high-sys-cpu stall observed at the moment.
> Should I be looking for something in particular?

We're looking for spikes in 'blk' which represents when lwlocks bump.
If you're not seeing any then this is suggesting a buffer pin related
issue -- this is also supported by the fact that raising shared
buffers didn't help.   If you're not seeing 'bk's, go ahead and
disable the stats macro.

So, what we need to know now is:
*) What happens when you drastically *lower* shared buffers?   Say, to
64mb?  Note, you may experience higher load for unrelated reasons and
have to scuttle the test.  Also, if you have to crank higher to handle
internal server structures, do that.  This is a hail mary, but maybe
something interesting spits out.

*) How many specific query plans are needed to introduce the
condition,  Hopefully, it's not too many.  If so, let's start
gathering the plans.  If you have a lot of plans to sift through, one
thing we can attempt to eliminate noise is to tweak
log_min_duration_statement so that during stall times (only) it logs
offending queries that are unexpectedly blocking.

*) Approximately how big is your 'working set' -- the data your
queries are routinely hitting?

*) Is the distribution of the *types* of queries uniform?  Or do you
have special processes that occur on intervals?

Thanks for your patience.

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general