Re: High CPU Utilization

Scott Marlowe <scott.marlowe@xxxxxxxxx> · Mon, 16 Mar 2009 16:09:13 -0600

On Mon, Mar 16, 2009 at 2:50 PM, Joe Uhl <joeuhl@xxxxxxxxx> wrote:
> I dropped the pool sizes and brought things back up.  Things are stable,
> site is fast, CPU utilization is still high.  Probably just a matter of time
> before issue comes back (we get slammed as kids get out of school in the
> US).

Yeah, I'm guessing your server (or more specifically its RAID card)
just aren't up to the task.  We had the same problem last year with a
machine with 16 Gig ram and dual dual core 3.0GHz xeons with a Perc 5
something or other.  No matter how we tuned it or played with it, we
just couldn't get good random performance out of it.  It's since been
replaced by a white box unit with a tyan mobo and dual 4 core opterons
and an Areca 1680 and a 12 drive RAID-10. We can sustain 30 to 60 Megs
a second random access with 0 to 10% iowait.

Here's a typical vmstat 10 output when our load factor is hovering around 8...
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  1    460 170812  92856 29928156    0    0   604  3986 4863 10146
74  3 20  3  0
 7  1    460 124160  92912 29939660    0    0   812  5701 4829 9733 70
 3 23  3  0
13  0    460 211036  92984 29947636    0    0   589  3178 4429 9964 69
 3 25  3  0
 7  2    460  90968  93068 29963368    0    0  1067  4463 4915 11081
78  3 14  5  0
 7  3    460 115216  93100 29963336    0    0  3008  3197 4032 11812
69  4 15 12  0
 6  1    460 142120  93088 29923736    0    0  1112  6390 4991 11023
75  4 15  6  0
 6  0    460 157896  93208 29932576    0    0   698  2196 4151 8877 71
 2 23  3  0
11  0    460 124868  93296 29948824    0    0   963  3645 4891 10382
74  3 19  4  0
 5  3    460  95960  93272 29918064    0    0   592 30055 5550 7430 56
 3 18 23  0
 9  0    460  95408  93196 29914556    0    0  1090  3522 4463 10421
71  3 21  5  0
 9  0    460 128632  93176 29916412    0    0   883  4774 4757 10378
76  4 17  3  0

Note the bursty parts where we're shoving out 30Megs a second and the
wait jumps to 23%.  That's about as bad as it gets during the day for
us.  NBote that in your graph your bi column appears to be dominating
your bo column, so it looks like you're reaching a point where the
write cache on the controller gets full and you're real throughput is
shown to be ~ 1 megabyte a second outbound, and the inbound traffic
either has priority or is just filling in the gaps.  It looks to me
like your RAID card is prioritizing reads over writes, and the whole
system is just slowing to a crawl.  I'm willing to bet that if you
were running pure SW RAID with no RAID controller you'd get better
numbers.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance