The time it takes for pg_dump to run has grown from 1 hour to 3 and even 4 hours over the last 6 months, with more than half of that increase occurring since we upgrade to 9.0.x. In the last several weeks (possibly since the upgrade to 9.0.4), we are seeing all connections getting used up (our main apps use connection pools, but monitoring and some utilities are making direct connections for each query, and some of them don't check for the prior query to complete before sending another, which slowly eats up available connections). Even the connection pool apps cease functioning during the backup, however, as all of the connections wind up in parse waiting state. I also see lots of sockets in close wait state for what seems to be an indefinite period while the backup is running and all connections are used up. I assume all of this is the result of pg_dump starting a transaction or otherwise blocking other access. I can get everything using a pool, that's not a huge problem to solve, but that won't fix the fundamental problem of no queries being able to finish while the backup is happening.
I know I'm not the only one running a database of this size. How do others handle backups? At the moment, I don't have replication happening. I can use the old hardware to replicate to. It doesn't have quite the i/o capacity and nowhere near as much RAM, but I wouldn't be looking to use it for querying unless I lost the primary, and it is definitely capable of handling the insert load, at least when the inserts are being done directly. I'm not sure if it is easier or harder for it to handle the same inserts via streaming replication. My question is, what are the performance repercussions of running a pg_dump backup off the replicated server. If it experiences the same kind of lockup, will SR get so far behind that it can't catch up? Is there some other preferred way to get a backup of a large db?
And finally, is the lockout I'm experiencing actually the result of a bug or misuse of pg_dump in some way?