Re: Trimming transaction logs after extended WAL archive failures

Steven Schlansker <steven@xxxxxxxxxxxx> · Wed, 26 Mar 2014 09:44:05 -0700

On Mar 26, 2014, at 9:04 AM, Jeff Janes <jeff.janes@xxxxxxxxx> wrote:

> On Tue, Mar 25, 2014 at 6:33 PM, Jeff Janes <jeff.janes@xxxxxxxxx> wrote:
> On Tuesday, March 25, 2014, Steven Schlansker <steven@xxxxxxxxxxxx> wrote:
> Hi everyone,
> 
> I have a Postgres 9.3.3 database machine.  Due to some intelligent work on the part of someone who shall remain nameless, the WAL archive command included a ‘> /dev/null 2>&1’ which masked archive failures until the disk entirely filled with 400GB of pg_xlog entries.
> 
> PostgreSQL itself should be logging failures to the server log, regardless of whether those failures log themselves.
> 
> 
> I have fixed the archive command and can see WAL segments being shipped off of the server, however the xlog remains at a stable size and is not shrinking.  In fact, it’s still growing at a (much slower) rate.
> 
> The leading edge of the log files should be archived as soon as they fill up, and recycled/deleted two checkpoints later.  The trailing edge should be archived upon checkpoints and then recycled or deleted.  I think there is a throttle on how many off the trailing edge are archived each checkpoint.  So issues a bunch of  "CHECKPOINT;" commands for a while and see if that clears it up.

Indeed, forcing a bunch of CHECKPOINTS started to get things moving again.

> 
> Actually my description is rather garbled, mixing up what I saw when wal_keep_segments was lowered, not when recovering from a long lasting archive failure.  Nevertheless, checkpoints are what provoke the removal of excessive WAL files.  Are you logging checkpoints?  What do they say?  Also, what is in pg_xlog/archive_status ?
>  

I do log checkpoints, but most of them recycle and don’t remove:
Mar 26 16:09:36 prd-db1a postgres[29161]: [221-1] db=,user= LOG:  checkpoint complete: wrote 177293 buffers (4.2%); 0 transaction log file(s) added, 0 removed, 56 recycled; write=539.838 s, sync=0.049 s, total=539.909 s; sync files=342, longest=0.015 s, average=0.000 s

That said, after letting the db run / checkpoint / archive overnight, the xlog did indeed start to slowly shrink.  The pace at which it is shrinking is somewhat unsatisfying, but at least we are making progress now!

I guess if I had just been patient I could have saved some mailing list traffic.  But patience is hard when your production database system is running at 0% free disk :)

Thanks everyone for the help, if the log continues to shrink, I should be out of the woods now.

Best,
Steven

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general