On Fri, Nov 16, 2012 at 9:52 AM, Vlad <marchenko@xxxxxxxxx> wrote: > Merlin, > > >> Yeah -- you're right, this is definitely spinlock issue. Next steps: >> >> *) in mostly read workloads, we have a couple of known frequent >> offenders. In particular the 'BufFreelistLock'. One way we can >> influence that guy is to try and significantly lower/raise shared >> buffers. So this is one thing to try. > > > server has 32gb with 3.2gb dedicated for share buffers. I've increased it to > 13.2gb, the stall still happened (I have a way of controlling number of > queries hitting postgresql by shutting down own memcache-based app cache, so > to test for stall I temporary shut down few memcached servers). > > >> *) failing that, LWLOCK_STATS macro can be compiled in to give us some >> information about the particular lock(s) we're binding on. Hopefully >> it's a lwlock -- this will make diagnosing the problem easier. > > > I've enabled that macro, seeing flying lwlock messages in the log (see > below), even when there is no high-sys-cpu stall observed at the moment. > Should I be looking for something in particular? We're looking for spikes in 'blk' which represents when lwlocks bump. If you're not seeing any then this is suggesting a buffer pin related issue -- this is also supported by the fact that raising shared buffers didn't help. If you're not seeing 'bk's, go ahead and disable the stats macro. So, what we need to know now is: *) What happens when you drastically *lower* shared buffers? Say, to 64mb? Note, you may experience higher load for unrelated reasons and have to scuttle the test. Also, if you have to crank higher to handle internal server structures, do that. This is a hail mary, but maybe something interesting spits out. *) How many specific query plans are needed to introduce the condition, Hopefully, it's not too many. If so, let's start gathering the plans. If you have a lot of plans to sift through, one thing we can attempt to eliminate noise is to tweak log_min_duration_statement so that during stall times (only) it logs offending queries that are unexpectedly blocking. *) Approximately how big is your 'working set' -- the data your queries are routinely hitting? *) Is the distribution of the *types* of queries uniform? Or do you have special processes that occur on intervals? Thanks for your patience. merlin -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general