Re: High CPU Utilization

Joe Uhl <joeuhl@xxxxxxxxx> · Fri, 20 Mar 2009 16:49:00 -0400

On Mar 20, 2009, at 4:29 PM, Scott Marlowe wrote:

On Fri, Mar 20, 2009 at 2:26 PM, Joe Uhl <joeuhl@xxxxxxxxx> wrote:
On Mar 17, 2009, at 12:19 AM, Greg Smith wrote:

On Tue, 17 Mar 2009, Gregory Stark wrote:

Hm, well the tests I ran for posix_fadvise were actually on a  
Perc5 --
though
who knows if it was the same under the hood -- and I saw better
performance
than this. I saw about 4MB/s for a single drive and up to about  
35MB/s
for 15
drives. However this was using linux md raid-0, not hardware raid.

Right, it's the hardware RAID on the Perc5 I think people mainly  
complain
about.  If you use it in JBOD mode and let the higher performance  
CPU in
your main system drive the RAID functions it's not so bad.

--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com  
Baltimore, MD

I have not yet had a chance to try software raid on the standby  
server
(still planning to) but wanted to follow up to see if there was any  
good way
to figure out what the postgresql processes are spending their CPU  
time on.

We are under peak load right now, and I have Zabbix plotting CPU  
utilization
and CPU wait (from vmstat output) along with all sorts of other  
vitals on
charts.  CPU utilization is a sustained 90% - 95% and CPU Wait is  
hanging
below 10%.  Since being pointed at vmstat by this list I have been  
watching
CPU Wait and it does get high at times (hence still wanting to try  
Perc5 in
JBOD) but then there are sustained periods, right now included,  
where our
CPUs are just getting crushed while wait and IO (only doing about  
1.5 MB/sec
right now) are very low.

This high CPU utilization only occurs when under peak load and when  
our JDBC
pools are fully loaded.  We are moving more things into our cache and
constantly tuning indexes/tables but just want to see if there is  
some
underlying cause that is killing us.

Any recommendations for figuring out what our database is spending  
its CPU
time on?

What does the cs entry on vmstat say at this time?  If you're cs is
skyrocketing then you're getting a context switch storm, which is
usually a sign that there are just too many things going on at once /
you've got an old kernel things like that.

cs column (plus cpu columns) of vmtstat 1 30 reads as follows:

cs    us  sy id wa
11172 95  4  1  0
12498 94  5  1  0
14121 91  7  1  1
11310 90  7  1  1
12918 92  6  1  1
10613 93  6  1  1
9382  94  4  1  1
14023 89  8  2  1
10138 92  6  1  1
11932 94  4  1  1
15948 93  5  2  1
12919 92  5  3  1
10879 93  4  2  1
14014 94  5  1  1
9083  92  6  2  0
11178 94  4  2  0
10717 94  5  1  0
9279  97  2  1  0
12673 94  5  1  0
8058  82 17  1  1
8150  94  5  1  1
11334 93  6  0  0
13884 91  8  1  0
10159 92  7  0  0
9382  96  4  0  0
11450 95  4  1  0
11947 96  3  1  0
8616  95  4  1  0
10717 95  3  1  0

We are running on 2.6.28.7-2 kernel.  I am unfamiliar with vmstat  
output but reading the man page (and that cs = "context switches per  
second") makes my numbers seem very high.

Our sum JDBC pools currently top out at 400 connections (and we are  
doing work on all 400 right now).  I may try dropping those pools down  
even smaller. Are there any general rules of thumb for figuring out  
how many connections you should service at maximum?  I know of the  
memory constraints, but thinking more along the lines of connections  
per CPU core.

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance