backups blocking everything

Samuel Gendler <sgendler@xxxxxxxxxxxxxxxx> · Thu, 27 Oct 2011 09:47:15 -0700

I've got a large mixed-used database, with the data warehouse side of things consisting of several tables at hundreds of millions of rows, plus a number of tables with tens of millions.  There is partitioning, but as the volume of data has risen, individual partitions have gotten quite large.  Hardware is 2x4 core 2.0Ghz Xeon processors, 176GB of RAM, 4 drives in raid 10 for WAL logs and 16 or 20 spindles for data, also in RAID 10.  Total database size is currently 399GB - via pg_database_size().  It's also worth noting that we switched from 8.4 to 9.0.4 only about a month ago, and we were not seeing this problem on 8.4.x.  The database is growing, but not at some kind of exponential rate. full backup, compressed, on the old hardware was 6.3GB and took about 1:45:00 to be written.  Recent backups are 8.3GB and taking 3 or 4 hours.  We were not seeing al queries stall out during the backups on 8.4, so far as I am aware.

The time it takes for pg_dump to run has grown from 1 hour to 3 and even 4 hours over the last 6 months, with more than half of that increase occurring since we upgrade to 9.0.x.  In the last several weeks (possibly since the upgrade to 9.0.4), we are seeing all connections getting used up (our main apps use connection pools, but monitoring and some utilities are making direct connections for each query, and some of them don't check for the prior query to complete before sending another, which slowly eats up available connections).  Even the connection pool apps cease functioning during the backup, however, as all of the connections wind up in parse waiting state.  I also see lots of sockets in close wait state for what seems to be an indefinite period while the backup is running and all connections are used up.  I assume all of this is the result of pg_dump starting a transaction or otherwise blocking other access.  I can get everything using a pool, that's not a huge problem to solve, but that won't fix the fundamental problem of no queries being able to finish while the backup is happening.

I know I'm not the only one running a database of this size.  How do others handle backups?  At the moment, I don't have replication happening.  I can use the old hardware to replicate to.  It doesn't have quite the i/o capacity and nowhere near as much RAM, but I wouldn't be looking to use it for querying unless I lost the primary, and it is definitely capable of handling the insert load, at least when the inserts are being done directly.  I'm not sure if it is easier or harder for it to handle the same inserts via streaming replication.  My question is, what are the performance repercussions of running a pg_dump backup off the replicated server.  If it experiences the same kind of lockup, will SR get so far behind that it can't catch up?  Is there some other preferred way to get a backup of a large db?

And finally, is the lockout I'm experiencing actually the result of a bug or misuse of pg_dump in some way?