On Sat, Feb 9, 2013 at 2:37 PM, Josh Krupka <jkrupka@xxxxxxxxx> wrote:
Johnny,
Sure thing, here's the system tap script:
Thank you for this!
- I think you already started looking at this, but the linux dirty memory settings may have to be tuned as well (see Greg's post http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html). Ours haven't been changed from the defaults, but that's another thing to test for next week. Have you had any luck tuning these yet?
We lowered dirty_background_bytes to 1/4 of what dirty_bytes was. That didn't get rid of the spikes, but seemed to have some impact -- just from 24 hours' observation, the spikes were clustered together more closely, and then there were long stretches without any spikes. Unfortunately, we only got to observe 24 hours before we made the next change.
The next change was lowering pgbouncer poolsize down to 50. We originally (way back when) started out at 100, then bumped up to 150. But Jeff Janes' rationale for LOWERING the poolsize/connections made sense to us.
And so far, 48 hours since lowering it, it does seem to have eliminated the DB spikes! We haven't seen one yet, and this is the longest we've gone without seeing one.
To be more precise, we now have a lot more "short" spikes -- i.e., our response time graphs are more jagged, but at least the peaks are under the threshold we desire. Previously, they were "smooth" in between the spikes.
We will probably tweak this knob some more -- i.e., what is the sweet spot between 1 and 100? Would it be higher than 50 but less than 100? Or is it somewhere lower than 50?
Even after we find that sweet spot, I'm still going to try some of the other suggestions. I do want to play with shared_buffers, just so we know whether, for our setup, it's better to have larger or smaller shared_buffers. I'd also like to test the THP stuff on a testing cluster, which we are still in the middle of setting up (or rather, we have set up, but we need to make it more prod-like).
johnny