On Tue, Oct 15, 2013 at 10:26 AM, Julien Cigar <jcigar@xxxxxxxxx> wrote:
for two reasons:
- some of the overhead of bgwriter and checkpoints is more or less linear
in the size of shared_buffers, for example it could be possible that a
large quantity of data could be dirty when a checkpoint occurs).
- the OS cache is also being used for reads and writes, the larger
shared_buffers is, the more you risk double buffering (same blocks
in the OS cache and in the database buffer cache).
We do streaming replication and also store them for snapshot PITR backups, so I am intimately familiar with our write load, and I can say it is pretty low (we ship a 16MB WAL file about every 10-15 minutes during our busiest times).
That said, I can see how an import that is doing a bunch of writes could possibly spread those over a large area that would then consume a lot of CPU on the writer and checkpoint; however, I do not see how either of those would cause 40-60 different postgres backgroud processes (all running a normally "light query") to spin off into oblivion unless the write work load is somehow threaded into the background workers (which I'm sure it isn't). So, I think we're still dealing with a spinlock issue.
We're going to upgrade to 9.1.10 (with debug symbols) Thursday night and add another 64GB of RAM. I'll tune shared_buffers down to 2GB at that time and bump effective_cache_size up at the same time. My large history of sar data will make it apparent pretty quickly if that is a win/lose/tie.
If we have another spike in production, we'll be ready to measure it more accurately.
Thanks,
Tony