Re: High CPU Utilization

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I dropped the pool sizes and brought things back up. Things are stable, site is fast, CPU utilization is still high. Probably just a matter of time before issue comes back (we get slammed as kids get out of school in the US).

Now when I run vmtstat 1 30 it looks very different (below). Waiting is minimal, user is very high. Under nontrivial load, according to xact_commit in pg_stat_database we are doing 1800+ tps.

Appreciate the input and explanation on vmstat. I am going to throw some of these numbers into zabbix so I can keep a better eye on them. This server is a couple years old so the purchase of a new controller and/or disks is not out of the question.

On final note, have made several changes to postgresql.conf. Some of those here:
max_connections = 1000
shared_buffers = 7680MB
work_mem = 30MB
synchronous_commit = off
checkpoint_segments = 50
effective_cache_size = 20000MB

procs -----------memory---------- ---swap-- -----io---- -system-- ---- cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 9 8 73036 500164 82200 23497748 3 4 669 541 1 1 23 3 54 19 20 4 73036 497252 82200 23500836 0 0 2500 680 11145 15168 91 4 2 2 21 1 73036 491416 82204 23503832 0 0 1916 920 10303 14032 94 4 1 1 23 5 73036 489580 82212 23505860 0 0 1348 3296 11682 15970 94 5 1 0 31 1 73036 481408 82220 23507752 0 0 984 8988 10123 11289 97 3 0 0 25 4 73036 483248 82232 23509420 0 0 1268 1312 10705 14063 96 4 0 0 23 4 73036 480096 82232 23512380 0 0 2372 472 9805 13996 94 5 1 1 24 4 73036 476732 82236 23515196 0 0 2012 720 10365 14307 96 3 1 0 22 1 73036 474468 82236 23516584 0 0 944 3108 9838 12831 95 4 1 0 14 1 73036 455756 82284 23534548 0 0 908 3284 9096 11333 94 4 1 0 10 2 73036 455224 82292 23536304 0 0 1760 416 12454 17736 89 6 3 2 17 0 73036 460620 82292 23538888 0 0 1292 968 12030 18333 90 7 2 1 13 4 73036 459764 82292 23539724 0 0 332 288 9722 14197 92 5 2 1 17 5 73036 457516 82292 23542176 0 0 1872 17752 10458 15465 91 5 2 1 19 4 73036 450804 82300 23545640 0 0 2980 640 10602 15621 90 6 2 2 24 0 73036 447660 82312 23547644 0 0 1736 10724 12401 15413 93 6 1 0 20 6 73036 444380 82320 23550692 0 0 2064 476 9008 10985 94 4 1 0 22 2 73036 442880 82328 23553640 0 0 2496 3156 10739 15211 93 5 1 1 11 1 73036 441448 82328 23555632 0 0 1452 3552 10812 15337 93 5 2 1 6 2 73036 439812 82348 23557420 0 0 1052 1128 8603 10514 91 3 3 2 6 3 73036 433456 82348 23560860 0 0 2484 656 7636 13033 68 4 14 14 6 3 73036 433084 82348 23562628 0 0 1400 408 6046 11778 70 3 18 9 5 0 73036 430776 82356 23564264 0 0 1108 1300 7549 13754 73 4 16 7 5 2 73036 430124 82360 23565580 0 0 1016 2216 7844 14507 72 4 18 6 4 2 73036 429652 82380 23567480 0 0 1168 2468 7694 15466 58 4 24 14 6 2 73036 427304 82384 23569668 0 0 1132 752 5993 13606 49 5 36 10 7 1 73036 423020 82384 23571932 0 0 1244 824 8085 18072 56 3 30 10 procs -----------memory---------- ---swap-- -----io---- -system-- ---- cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 4 0 73036 420816 82392 23573824 0 0 1292 820 5370 10958 46 2 41 10 9 1 73020 418048 82392 23576900 52 0 1632 2592 5931 11629 60 3 29 8 4 2 73004 415164 82424 23578620 56 0 1812 4116 7503 14674 71 3 15 12

On Mar 16, 2009, at 4:19 PM, Dave Youatt wrote:
Last column "wa" is % cpu time spent waiting (for IO to complete). 80s
and 90s is pretty high, probably too high.

Might also want to measure the IO/s performance of your RAID
controller. From the descriptions, it will be much more important that
long sequential reads/writes for characterizing your workload.

There are also some disappointing HW RAID controllers out there.
Generally, Aretec and Promise are good, Adaptec good, depending on
model, and the ones that Dell ship w/their servers haven't had good
reviews/reports.


On 03/16/2009 01:10 PM, Joe Uhl wrote:
Here is vmstat 1 30. We are under peak load right now so I can gather
information from the real deal :)

Had an almost complete lockup a moment ago, number of non-idle
postgres connections was 637. Going to drop our JDBC pool sizes a bit
and bounce everything.

procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy
id wa
12 35  95056 11102380  56856 14954948    3    4   669   541    1    2
23  3 54 19
12 39  95056 11092484  56876 14963204    0    0  6740  1204 10066
13277 91  5  0  4
8 42 95056 11081712 56888 14972244 0 0 8620 1168 10659 17020
78  6  0 15
10 30 95052 11069768 56904 14982628 0 0 8944 976 9809 15109
81  6  1 12
4 27  95048 11059576  56916 14991296    0    0  8852   440 7652 13294
63  4  2 32
5 42  95048 11052524  56932 14996496    0    0  4700   384 6383 11249
64  4  4 28
5 33  95048 11047492  56956 15001428    0    0  3852   572 6029 14010
36  4  5 56
7 35  95048 11041184  56960 15005480    0    0  3964   136 5042 10802
40  3  1 56
1 33  95048 11037988  56968 15009240    0    0  3892   168 3384 6479
26  1  3 69
3 28  95048 11029332  56980 15015744    0    0  6724   152 4964 12844
11  2  8 79
0 34  95048 11025880  56988 15020168    0    0  3852   160 3616 8614
11  1  6 82
3 25  95048 10996356  57044 15044796    0    0  7892   456 3126 7115
4  3  8 85
1 26  95048 10991692  57052 15050100    0    0  5188   176 2566 5976
3  2 12 83
0 29  95048 10985408  57060 15054968    0    0  4200    80 2586 6582
4  1 12 83
1 29  95048 10980828  57064 15058992    0    0  4560    64 2966 7557
7  2  6 85
2 28  95048 10977192  57072 15063176    0    0  3860    72 2695 6742
11  1  7 81
2 29  95048 10969120  57088 15067808    0    0  5084    84 3296 8067
14  1  0 84
0 25  95048 10962096  57104 15072984    0    0  4440   500 2721 6263
12  1  6 80
0 23  95044 10955320  57108 15079260    0    0  5712   232 2678 5990
6  1  6 87
2 25  95044 10948644  57120 15084524    0    0  5120   184 3499 8143
20  3  9 69
3 21  95044 10939744  57128 15090644    0    0  5756   264 4724 10272
32  3  5 60
1 19  95040 10933196  57144 15095024   12    0  4440   180 2585 5244
13  2 15 70
0 21  95040 10927596  57148 15098684    0    0  3248   136 2973 7292
8  1  9 81
1 20  95040 10920708  57164 15104244    0    0  5192   360 1865 4547
3  1  9 87
1 24  95040 10914552  57172 15105856    0    0  2308    16 1948 4450
6  1  1 93
0 24  95036 10909148  57176 15110240    0    0  3824   152 1330 2632
3  1  6 90
1 21  95036 10900628  57192 15116332    0    0  5680   180 1898 3986
4  1 11 84
0 19  95036 10888356  57200 15121736    0    0  5952   120 2252 3991
12  1  8 79
2 22  95036 10874336  57204 15128252    0    0  6320   112 2831 6755
5  2  8 85
3 26  95036 10857592  57220 15134020    0    0  5124   216 3067 5296
32  6  3 59

Alan, my apologies if you get this twice.  Didn't reply back to the
list on first try.

On Mar 16, 2009, at 3:52 PM, Alan Hodgson wrote:

On Monday 16 March 2009, Joe Uhl <joeuhl@xxxxxxxxx> wrote:
Right now (not under peak load) this server is running at 68% CPU
utilization and its SATA raid 10 is doing about 2MB/s writes and 11MB/ s reads. When I run dd I can hit 200+MB/s writes and 230+ MB/s reads,
so we are barely using the available IO.  Further when I run dd the
CPU utilization of that process only approaches 20%-30% of one core.

What does vmstat say when it's slow? The output of "vmstat 1 30"
would be
informative.

note: dd is sequential I/O. Normal database usage is random I/O.

--
Even a sixth-grader can figure out that you can’t borrow money to pay
off
your debt

--
Sent via pgsql-performance mailing list
(pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance




On Mar 16, 2009, at 4:35 PM, Greg Smith wrote:

On Mon, 16 Mar 2009, Joe Uhl wrote:

Here is vmstat 1 30. We are under peak load right now so I can gather information from the real deal

Quite helpful, reformatting a bit and picking an informative section:

procs -----------memory---------- ---swap- ----io--- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 34 95048 11025880 56988 15020168 0 0 3852 160 3616 8614 11 1 6 82 3 25 95048 10996356 57044 15044796 0 0 7892 456 3126 7115 4 3 8 85 1 26 95048 10991692 57052 15050100 0 0 5188 176 2566 5976 3 2 12 83

This says that your server is spending all its time waiting for I/O, actual CPU utilization is minimal. You're only achieving around 3-8MB/s of random I/O. That's the reality of what your disk I/O subsystem is capable of, regardless of what its sequential performance with dd looks like. If you were to run a more complicated benchmark like bonnie++ instead, I'd bet that your "seeks/second" results are very low, even though sequential read/ write is fine.

The Perc5 controllers have a pretty bad reputation for performance on this list, even in RAID10. Not much you can do about that beyond scrapping the controller and getting a better one.

What you might do in order to reduce the total number of writes needed is some standard postgresql.conf tuning; see http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

What you could do here is increase shared_buffers, checkpoint_segments, and checkpoint_completion_target as described there. Having more buffers dedicated to the database and having less checkpoints can result in less random I/O showing up, as popular data pages will stay in RAM for longer without getting written out so much.

--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com Baltimore, MD


--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux