Hi Bill, On 13 June 2014 20:35, Bill Moran <wmoran@xxxxxxxxxxxxxxxxx> wrote: > On Fri, 13 Jun 2014 20:02:01 +0100 Jaco Engelbrecht <jengelbrecht@xxxxxxxxxxxxx> wrote: >> >> This coincides with a checkpoint: > > There's a lot going on here (as always with a busy server) but I suspected > a checkpoint problem earlier, and this statement and your additional > information makes me suspect even harder. > > In your earlier email with the .conf file, I seem to remember that > you had checkpoint_segments set to 256 and checkpoint_timeout set to > 30m. It's obvious from reading this thread that you've already put > quite a bit of effort into resolving this. My question: have you > tried _lowering_ the checkpoint settings? If we assume that the > stall is related to checkpoint, and I'm remembering correctly on the > settings, then PostgreSQL might have as much as 4G of wal logs to > grind through to complete a checkpoint. While not huge, if that's > trying to complete at the same time a lot of other work is going on, > it could cause stalls. If you lower the chckpoint_segments and > checkpoint_timeout, it will cause _more_ disk activity overall, but > it will be spread out more. Whether or not this helps with your > particular situation is dependent on whether your incidents are > caused by a spike in activity (in which case it might help) or > a cumulative effect of a lot of activity (in which case it will > probably make the situation worse). Thanks, we're going to look into that. I checked our revision history to see if we changed any checkpoint settings over the last year and we have not, however what I did notice was that a few days before we first experienced this issue we increased the wal_keep_segments from 256 to 1024 (and then later further to 1536) in order to keep enough WAL records around for our backups. Sure enough, I just found a post at http://www.postgresql.org/message-id/CAPVp=gbKVbNr1zQM_LKauNY-U1PHB++y=Xq26K-dXdDsffv_PQ@xxxxxxxxxxxxxx describing a similar issue related to having wal_keep_segments set to 1024 (with a much lower checkpoint_segments set - 32) but no resolution on the list. Any thoughts on the wal_keep_segments we have set to 1024 currently? > Another thing that I may be misremembering from from your earlier > email: did you say that the load on the database was mostly write > (or am I misremembering that you said the OS graphs were showing > mostly write?) The reason I'm asking is that we've seen problems > like you describe when trying to implement a high volume queue > in PostgreSQL: the continuous INSERT/SELECT/DELETE grind on the > single queue table was just more than PostgreSQL could keep up > with. We moved that one portion of the application to Redis and > everything else just fell in line. I'm stretching a bit to suppose > that you have a similar problem, but it's another data point for > you to consider. Yes, mostly writes. We already use Redis for some aspects of the site, but we'll look into what else we could move there. Jaco