Re: Startup process thrashing

Greg Smith <gsmith@xxxxxxxxxxxxx> · Thu, 11 Dec 2008 02:02:59 -0500 (EST)

On Thu, 11 Dec 2008, Phillip Berry wrote:

I'm not running PITR and checkpoint_segments is set to 100 as this is 
home to a very write intensive app.

That's weird then.  It shouldn't ever keep around more than 201 WAL 
segments.  I've heard one report of a similarly mysterious excess of them, 
from Robert Treat, but that was probably caused by a hardware failure.

1. Will copying the pg_xlog dir onto a larger partition cause any 
problems?

Ah, I didn't realize the disk it's on was still 100% full from what you 
said.  You're not going to get anywhere until that's sorted out.  You 
absolutely can move them all to another disk, you just can't delete them 
altogether or split them across logical disks.  The usual technique is 
something like this:

cd $PGDATA
mv pg_xlog $BIGGER
ln -s $BIGGER/pg_xlog

And then fire the server up in single-user mode to issue the VACUUM it's 
begging for.  Once it's back to running in regular mode again, issue a 
"CHECKPOINT" command, and it should erase most of those WAL files.  Then 
you could stop the server and reverse the above:

cd $PGDATA
unlink pg_xlog
mv $BIGGER/pg_xlog .

While filling that disk and grinding to a halt is bad, it's not 
necessarily any worse then the alternative if you didn't have it on a 
dedicated filesystem--that this out of control xlog creation might create 
an even greater backlog of segments.

2. Is there any reason that vacuuming would create so many WAL files? 
I've looked around and can't find any explaination, though it's a 
problem that seems to have happened before.

Nothing should make that many WAL files, normally the only way you can get 
into that position is if you're running PITR recovery but not archiving 
the segments--so they just keep piling up but the server can't recycle 
them.  What's supposed to happen is that every time there is a checkpoint, 
it deletes anything beyond 2*checkpoint_segments+1 worth of segments, 
that's where the 201 figure I mentioned comes from.

3. I installed pgbouncer yesterday for connection pooling, does the fact 
that it holds connections open have any bearing on how rapidly the 
transaction IDs are used up? As in are transaction IDs in anyway related 
to connections? It's a pretty interesting coincidence.

I'm not really familiar with pgbouncer to comment on the specifics of what 
it does.  In general, keeping transactions open longer will impact the 
pg_clog section of the database, but that shouldn't cause the WAL to grow 
out of control.  Once you get this back into production, you certainly 
should carefully monitor how much disk space is being taken up on the WAL 
disk moving forward.  If the count of them there gets back over 201 again, 
that's an indication there's something that shouldn't be happening going 
on.  That's the sort of thing you can get more information about

This isn't a real popular time of day for this list, some get some more 
ideas might show up once our many European members start posting in 
volume.  As a general aside, if you ever find yourself in this position 
again, where you've got an urgent database problem, something you might do 
in parallel with posting here is trying the IRC channel: 
http://wiki.postgresql.org/wiki/IRC2RWNames

--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general