Re: FW: Queries becoming slow under heavy load

Scott Marlowe <scott.marlowe@xxxxxxxxx> · Wed, 26 Jan 2011 22:07:42 -0700

On Wed, Jan 26, 2011 at 10:16 AM, Shaun Thomas <sthomas@xxxxxxxxx> wrote:
> Worse however, is your checkpoints. Lord. Increase checkpoint_segments to
> *at least* 20, and increase your checkpoint_completion_target to 0.7 or 0.8.
> Check your logs for checkpoint warnings, and I'll bet it's constantly
> complaining about increasing your checkpoint segments. Every checkpoint not
> started by the scheduled system risks a checkpoint spike, which can flood
> your system with IO regardless of which queries are running. That kind of IO
> storm will ruin your performance, and with only 3 checkpoint segments on a
> busy database, are probably happening constantly.

To Shaun:

Unless she's not write bound but read bound.  We can't tell because we
haven't seen the queries.  We haven't seen the output of iostat or
vmstat.

To Anne:

Another tool to recommend is sar.  it's part of the sysstat package on
debian / ubuntu / rhel.  you have to enable it in various ways, it'll
tell you when you try to run it after installing it.   It allows you
to look back over the last 7 days, 5 minutes at a time, to see the
trends on your servers.  Very useful stuff and easy to graph in a
spreadsheet or web page.  Or just read it.

For instance, here's the output of sar on the data drive of a slave
(read only) server under slony replication.

sar -d -f sa25|grep "02:[01234].:.. AM.\+dev251-1"
Linux 2.6.32-27-server () 	01/25/2011 	_x86_64_	(16 CPU)

12:00:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz
avgqu-sz     await     svctm     %util
09:45:01 AM  dev251-1    481.21   6981.74   1745.82     18.14
4.86     10.10      1.57     75.65
09:55:01 AM  dev251-1    620.16  28539.52   2135.67     49.46
5.25      8.47      1.22     75.42
10:05:01 AM  dev251-1   1497.16  29283.95   1898.94     20.83
13.89      9.28      0.64     96.52
10:15:01 AM  dev251-1   1040.47  17123.89   2286.10     18.66
8.89      8.54      0.87     90.49
10:25:01 AM  dev251-1    562.97   8802.77   1515.50     18.33
4.84      8.60      1.41     79.57

Let me interpret for ya, in case it's not obvious.  IO Utilization
runs from about 50% to about 90%.  when it's at 90% we are running 700
to 1000 tps, reading at a maximum of 15MB a second and writing at a
paltry 1M or so a second.  Average wait stays around 10ms.  If we use
sar -u from the same time period, we cna match up iowait to this chart
and see if we were really waiting on IO or not.

12:00:01 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
09:45:01 AM     all     47.44      0.00      5.20      4.94      0.00     42.42
09:55:01 AM     all     46.42      0.00      5.63      5.77      0.00     42.18
10:05:01 AM     all     48.64      0.00      6.35     11.87      0.00     33.15
10:15:01 AM     all     46.94      0.00      5.79      8.81      0.00     38.46
10:25:01 AM     all     48.68      0.00      5.58      5.42      0.00     40.32

We can see that we have at peak, 11% of our CPU power is waiting
behind IO.  We have 16 CPUs, so each one is 100/16 or 6.25% of the
total.  So at 11% we have two cores on hold the whole time basically.
In real life on this machine we have ALL cpus waiting about 11% of the
time across the board.  But the math comes out the same.  We're
waiting on IO.

Here's a heavy db server, lots of ram, same time period.  sdh is one
of a large number of disks in a RAID-10 array.  md17 is that RAID-10
array (actually the RAID0 at the top of a bunch of RAID-1s I still
don't trust linux's RAID-10 implementation).
12:00:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz
avgqu-sz     await     svctm     %util
09:45:01          sdh      5.86      5.65    158.87     28.08
0.21     35.07      3.36      1.97
09:45:01         md17    253.78    168.69   1987.64      8.50
0.00      0.00      0.00      0.00
09:55:01          sdh      5.48      5.65    134.99     25.68
0.16     30.00      3.31      1.81
09:55:01         md17    215.13    157.56   1679.67      8.54
0.00      0.00      0.00      0.00
10:05:01          sdh      4.37      5.39    106.53     25.58
0.09     21.61      3.57      1.56
10:05:01         md17    170.81    133.76   1334.43      8.60
0.00      0.00      0.00      0.00
10:15:01          sdh      6.16      5.37    177.25     29.64
0.25     40.95      3.38      2.08
10:15:01         md17    280.63    137.88   2206.95      8.36
0.00      0.00      0.00      0.00
10:25:01          sdh      4.52      3.72    116.41     26.56
0.09     20.64      3.58      1.62
10:25:01         md17    187.65    107.59   1470.88      8.41
0.00      0.00      0.00      0.00

(Note that md devices do not show values for %util, svctm or await
hence the need for sdh)

This db fits the data set in ram, the other machine doesn't.  It had a
RAID controller, but that caught fire, and burned down.  The next
caught fire, burned down, and fell into the swamp.  It now has a
straight up SAS controller with no caching.  Numbers were even better
when it had a caching RAID controller, but I got tired of replacing
them.

OK, so md17 is handling only 280 tps, while the array on the other,
smaller server, was around 1,000.  The master is reading 5 or 6
sectors per second, while the slave is reading 30k sectors a second.
The master is writing at ~1500 to 2000 sectors a second, the slave is
similar.  The slave server here was IO bound very badly because it A:
didn't have enough memory to cache the data set, and B: hadn't had
time to warm up to get what memory it had to do the job.  It was
thrown into the mix mid morning rush and it fell flat on its ass.  If
it had been warmed up first (running it at 100:1 load factor by our
load balancing module to start) it would have been ok.  It would have
still had horrible IO numbers though.   Once the caches load by
1500hrs, the slave is reading at just 500 sectors / sec, child's play
really.

So, get sar running, and get some numbers from the machine when these
things are happening.  Look for how it looks before during and after
the crisis.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance