Re: Spurious Stalls

Jaco Engelbrecht <jengelbrecht@xxxxxxxxxxxxx> · Sat, 14 Jun 2014 00:26:39 +0100

Hi Bill,

On 13 June 2014 20:35, Bill Moran <wmoran@xxxxxxxxxxxxxxxxx> wrote:
> On Fri, 13 Jun 2014 20:02:01 +0100 Jaco Engelbrecht <jengelbrecht@xxxxxxxxxxxxx> wrote:
>>
>> This coincides with a checkpoint:
>
> There's a lot going on here (as always with a busy server) but I suspected
> a checkpoint problem earlier, and this statement and your additional
> information makes me suspect even harder.
>
> In your earlier email with the .conf file, I seem to remember that
> you had checkpoint_segments set to 256 and checkpoint_timeout set to
> 30m.  It's obvious from reading this thread that you've already put
> quite a bit of effort into resolving this.  My question: have you
> tried _lowering_ the checkpoint settings?  If we assume that the
> stall is related to checkpoint, and I'm remembering correctly on the
> settings, then PostgreSQL might have as much as 4G of wal logs to
> grind through to complete a checkpoint.  While not huge, if that's
> trying to complete at the same time a lot of other work is going on,
> it could cause stalls.  If you lower the chckpoint_segments and
> checkpoint_timeout, it will cause _more_ disk activity overall, but
> it will be spread out more.  Whether or not this helps with your
> particular situation is dependent on whether your incidents are
> caused by a spike in activity (in which case it might help) or
> a cumulative effect of a lot of activity (in which case it will
> probably make the situation worse).

Thanks, we're going to look into that.

I checked our revision history to see if we changed any checkpoint
settings over the last year and we have not, however what I did notice
was that a few days before we first experienced this issue we
increased the wal_keep_segments from 256 to 1024 (and then later
further to 1536) in order to keep enough WAL records around for our
backups.

Sure enough, I just found a post at
http://www.postgresql.org/message-id/CAPVp=gbKVbNr1zQM_LKauNY-U1PHB++y=Xq26K-dXdDsffv_PQ@xxxxxxxxxxxxxx
describing a similar issue related to having wal_keep_segments set to
1024 (with a much lower checkpoint_segments set - 32) but no
resolution on the list.

Any thoughts on the wal_keep_segments we have set to 1024 currently?

> Another thing that I may be misremembering from from your earlier
> email: did you say that the load on the database was mostly write
> (or am I misremembering that you said the OS graphs were showing
> mostly write?)  The reason I'm asking is that we've seen problems
> like you describe when trying to implement a high volume queue
> in PostgreSQL: the continuous INSERT/SELECT/DELETE grind on the
> single queue table was just more than PostgreSQL could keep up
> with.  We moved that one portion of the application to Redis and
> everything else just fell in line.  I'm stretching a bit to suppose
> that you have a similar problem, but it's another data point for
> you to consider.

Yes, mostly writes.  We already use Redis for some aspects of the
site, but we'll look into what else we could move there.

Jaco