Re: [Slony1-general] WAL partition overloaded--by autovacuum?

Greg Smith <greg@xxxxxxxxxxxxxxx> · Sat, 10 Jul 2010 00:23:53 +0100

Richard Yen wrote:
I figured that pg_xlog and data/base could both be on the FusionIO drive, since there would be no latency when there are no spindles.

(Rolls eyes) Please be careful about how much SSD Kool-Aid you drink, 
and be skeptical of vendor claims. They don't just make latency go away, 
particularly on heavy write workloads where the technology is at its 
weakest.

Also, random note, I'm seeing way too many FusionIO drive setups where 
people don't have any redundancy to cope with a drive failure, because 
the individual drives are so expensive they don't have more than one. 
Make sure that if you lose one of the drives, you won't have a massive 
data loss. Replication might help with that, if you can stand a little 
bit of data loss when the SSD dies. Not if--when. Even if you have a 
good one they don't last forever.

This means my pg_xlog partition should be (2 + checkpoint_completion_target) * checkpoint_segments + 1 = 41 files, or 656MB.  Then, if there are more than 49 files, unneeded segment files will be deleted, but in this case all segment files are needed, so they never got deleted.  Perhaps we should add in the docs that pg_xlog should be the size of the DB or larger?

Excessive write volume beyond the capacity of the hardware can end up 
delaying the normal checkpoint that would have cleaned up all the xlog 
files. There's a nasty spiral that can get into I've seen a couple of 
times in similar form to what you reported. The pg_xlog should never 
exceed the size computed by that formula for very long, but it can burst 
above its normal size limits for a little bit. This is already mentioned 
as possibility in the manual: "If, due to a short-term peak of log 
output rate, there are more than 3 * checkpoint_segments + 1 segment 
files, the unneeded segment files will be deleted instead of recycled 
until the system gets back under this limit." Autovacuum is an easy way 
to get the sort of activity needed to cause this problem, but I don't 
know if it's a necessary component to see the problem. You have to be in 
an unusual situation before the sum of the xlog files is anywhere close 
to the size of the database though.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@xxxxxxxxxxxxxxx   www.2ndQuadrant.us

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance