Thanks Mark,
Armand
I had a look at the iostat output (on a 5s interval) and pasted it below. The utilization and waits seems low. Included a sample below, #1 taken during normal operation and then when the locks happen it basically drops to 0 across the board. My (mis)understanding of the IOPS was that it would be 1000 IOPS per/volume and when in RAID0 should give me quite a bit higher throughput than in a single EBS volume setup. (My naive envelop calculation was #volumes * PIOPS = Effective IOPS :/)
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdk 0.00 0.00 141.60 0.00 5084.80 0.00 35.91 0.43 3.06 0.51 7.28
xvdj 0.00 0.00 140.40 0.40 4614.40 24.00 32.94 0.49 3.45 0.52 7.28
xvdi 0.00 0.00 123.00 2.00 4019.20 163.20 33.46 0.33 2.63 0.68 8.48
xvdh 0.00 0.00 139.80 0.80 4787.20 67.20 34.53 0.52 3.73 0.55 7.68
xvdg 0.00 0.00 143.80 0.20 4804.80 16.00 33.48 0.86 6.03 0.72 10.40
xvdf 0.00 0.00 146.40 0.00 4758.40 0.00 32.50 0.55 3.76 0.55 8.00
md127 0.00 0.00 831.20 3.40 27867.20 270.40 33.71 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 100.00 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdk 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
xvdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md127 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
It only spikes to 100% util when the server restarts. What bugs me though is Cloud Metrics show 100% Throughput on all the volumes despite the output above.
I'm looking into vm.dirty_background_ratio, vm.dirty_ratio sysctls. Is there any guidance or links available that would be useful as a starting point?
Thanks again for the help, I really appreciate it.
Regards,
Armand
On Tue, Apr 2, 2013 at 2:11 AM, Mark Kirkwood <mark.kirkwood@xxxxxxxxxxxxxxx> wrote:
In addition to tuning the various Postgres config knobs you may need to look at how your AWS server is set up. If your load is causing an IO stall then *symptoms* of this will be lots of locks...
You have quite a lot of memory (60G), so look at tuning the vm.dirty_background_ratio, vm.dirty_ratio sysctls to avoid trying to *suddenly* write out many gigs of dirty buffers.
Your provisioned volumes are much better than the default AWS ones, but are still not hugely fast (i.e 1000 IOPS is about 8 MB/s worth of Postgres 8k buffers). So you may need to look at adding more volumes into the array, or adding some separate ones and putting pg_xlog directory on 'em.
However before making changes I would recommend using iostat or sar to monitor how volumes are handling the load (I usually choose a 1 sec granularity and look for 100% util and high - server hundred ms - awaits). Also iotop could be enlightening.
Regards
Mark