On Thu, Oct 27, 2011 at 1:45 PM, Nicholson, Brad (Toronto, ON, CA) <bnicholson@xxxxxx> wrote:
>From: pgsql-performance-owner@xxxxxxxxxxxxxx [mailto:pgsql-performance-owner@xxxxxxxxxxxxxx] On Behalf Of Samuel Gendler
>Sent: Thursday, October 27, 2011 12:47 PM
>To: pgsql-performance@xxxxxxxxxxxxxx
>Subject: backups blocking everything
>What is the I/O utilization like during the dump? I've seen this situation in the past and it was caused be excessively bloated tables causing I/O starvation while they are getting dumped.
>I've got a large mixed-used database, with the data warehouse side of things consisting of several tables at hundreds of millions of rows, plus a number of tables with tens of >millions. There is partitioning, but as the volume of data has risen, individual partitions have gotten quite large. Hardware is 2x4 core 2.0Ghz Xeon processors, 176GB of RAM, 4 drives in >raid 10 for WAL logs and 16 or 20 spindles for data, also in RAID 10. Total database size is currently 399GB - via pg_database_size(). It's also worth noting that we switched from 8.4 to >9.0.4 only about a month ago, and we were not seeing this problem on 8.4.x. The database is growing, but not at some kind of exponential rate. full backup, compressed, on the old hardware >was 6.3GB and took about 1:45:00 to be written. Recent backups are 8.3GB and taking 3 or 4 hours. We were not seeing al queries stall out during the backups on 8.4, so far as I am aware.
>
>The time it takes for pg_dump to run has grown from 1 hour to 3 and even 4 hours over the last 6 months, with more than half of that increase occurring since we upgrade to 9.0.x. In the >last several weeks (possibly since the upgrade to 9.0.4), we are seeing all connections getting used up (our main apps use connection pools, but monitoring and some utilities are making >direct connections for each query, and some of them don't check for the prior query to complete before sending another, which slowly eats up available connections). Even the connection >pool apps cease functioning during the backup, however, as all of the connections wind up in parse waiting state. I also see lots of sockets in close wait state for what seems to be an >indefinite period while the backup is running and all connections are used up. I assume all of this is the result of pg_dump starting a transaction or otherwise blocking other access. I >can get everything using a pool, that's not a huge problem to solve, but that won't fix the fundamental problem of no queries being able to finish while the backup is happening.
There are definitely no bloated tables. The large tables are all insert-only, and old data is aggregated up and then removed by dropping whole partitions. There should be no bloat whatsoever. The OLTP side of things is pretty minimal, and I can pg_dump those schemas in seconds, so they aren't the problem, either. I don't know what the I/O utilization is during the dump, offhand. I'll be doing a more thorough investigation tonight, though I suppose I could go look at the monitoring graphs if I weren't in the middle of 6 other things at the moment. the joys of startup life.