I dropped the pool sizes and brought things back up. Things are
stable, site is fast, CPU utilization is still high. Probably just a
matter of time before issue comes back (we get slammed as kids get out
of school in the US).
Now when I run vmtstat 1 30 it looks very different (below). Waiting
is minimal, user is very high. Under nontrivial load, according to
xact_commit in pg_stat_database we are doing 1800+ tps.
Appreciate the input and explanation on vmstat. I am going to throw
some of these numbers into zabbix so I can keep a better eye on them.
This server is a couple years old so the purchase of a new controller
and/or disks is not out of the question.
On final note, have made several changes to postgresql.conf. Some of
those here:
max_connections = 1000
shared_buffers = 7680MB
work_mem = 30MB
synchronous_commit = off
checkpoint_segments = 50
effective_cache_size = 20000MB
procs -----------memory---------- ---swap-- -----io---- -system-- ----
cpu----
r b swpd free buff cache si so bi bo in cs us
sy id wa
9 8 73036 500164 82200 23497748 3 4 669 541 1 1
23 3 54 19
20 4 73036 497252 82200 23500836 0 0 2500 680 11145 15168
91 4 2 2
21 1 73036 491416 82204 23503832 0 0 1916 920 10303 14032
94 4 1 1
23 5 73036 489580 82212 23505860 0 0 1348 3296 11682 15970
94 5 1 0
31 1 73036 481408 82220 23507752 0 0 984 8988 10123 11289
97 3 0 0
25 4 73036 483248 82232 23509420 0 0 1268 1312 10705 14063
96 4 0 0
23 4 73036 480096 82232 23512380 0 0 2372 472 9805 13996
94 5 1 1
24 4 73036 476732 82236 23515196 0 0 2012 720 10365 14307
96 3 1 0
22 1 73036 474468 82236 23516584 0 0 944 3108 9838 12831
95 4 1 0
14 1 73036 455756 82284 23534548 0 0 908 3284 9096 11333
94 4 1 0
10 2 73036 455224 82292 23536304 0 0 1760 416 12454 17736
89 6 3 2
17 0 73036 460620 82292 23538888 0 0 1292 968 12030 18333
90 7 2 1
13 4 73036 459764 82292 23539724 0 0 332 288 9722 14197
92 5 2 1
17 5 73036 457516 82292 23542176 0 0 1872 17752 10458 15465
91 5 2 1
19 4 73036 450804 82300 23545640 0 0 2980 640 10602 15621
90 6 2 2
24 0 73036 447660 82312 23547644 0 0 1736 10724 12401 15413
93 6 1 0
20 6 73036 444380 82320 23550692 0 0 2064 476 9008 10985
94 4 1 0
22 2 73036 442880 82328 23553640 0 0 2496 3156 10739 15211
93 5 1 1
11 1 73036 441448 82328 23555632 0 0 1452 3552 10812 15337
93 5 2 1
6 2 73036 439812 82348 23557420 0 0 1052 1128 8603 10514
91 3 3 2
6 3 73036 433456 82348 23560860 0 0 2484 656 7636 13033
68 4 14 14
6 3 73036 433084 82348 23562628 0 0 1400 408 6046 11778
70 3 18 9
5 0 73036 430776 82356 23564264 0 0 1108 1300 7549 13754
73 4 16 7
5 2 73036 430124 82360 23565580 0 0 1016 2216 7844 14507
72 4 18 6
4 2 73036 429652 82380 23567480 0 0 1168 2468 7694 15466
58 4 24 14
6 2 73036 427304 82384 23569668 0 0 1132 752 5993 13606
49 5 36 10
7 1 73036 423020 82384 23571932 0 0 1244 824 8085 18072
56 3 30 10
procs -----------memory---------- ---swap-- -----io---- -system-- ----
cpu----
r b swpd free buff cache si so bi bo in cs us
sy id wa
4 0 73036 420816 82392 23573824 0 0 1292 820 5370 10958
46 2 41 10
9 1 73020 418048 82392 23576900 52 0 1632 2592 5931 11629
60 3 29 8
4 2 73004 415164 82424 23578620 56 0 1812 4116 7503 14674
71 3 15 12
On Mar 16, 2009, at 4:19 PM, Dave Youatt wrote:
Last column "wa" is % cpu time spent waiting (for IO to complete).
80s
and 90s is pretty high, probably too high.
Might also want to measure the IO/s performance of your RAID
controller. From the descriptions, it will be much more important
that
long sequential reads/writes for characterizing your workload.
There are also some disappointing HW RAID controllers out there.
Generally, Aretec and Promise are good, Adaptec good, depending on
model, and the ones that Dell ship w/their servers haven't had good
reviews/reports.
On 03/16/2009 01:10 PM, Joe Uhl wrote:
Here is vmstat 1 30. We are under peak load right now so I can
gather
information from the real deal :)
Had an almost complete lockup a moment ago, number of non-idle
postgres connections was 637. Going to drop our JDBC pool sizes a
bit
and bounce everything.
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us
sy
id wa
12 35 95056 11102380 56856 14954948 3 4 669 541 1 2
23 3 54 19
12 39 95056 11092484 56876 14963204 0 0 6740 1204 10066
13277 91 5 0 4
8 42 95056 11081712 56888 14972244 0 0 8620 1168 10659
17020
78 6 0 15
10 30 95052 11069768 56904 14982628 0 0 8944 976 9809
15109
81 6 1 12
4 27 95048 11059576 56916 14991296 0 0 8852 440 7652 13294
63 4 2 32
5 42 95048 11052524 56932 14996496 0 0 4700 384 6383 11249
64 4 4 28
5 33 95048 11047492 56956 15001428 0 0 3852 572 6029 14010
36 4 5 56
7 35 95048 11041184 56960 15005480 0 0 3964 136 5042 10802
40 3 1 56
1 33 95048 11037988 56968 15009240 0 0 3892 168 3384 6479
26 1 3 69
3 28 95048 11029332 56980 15015744 0 0 6724 152 4964 12844
11 2 8 79
0 34 95048 11025880 56988 15020168 0 0 3852 160 3616 8614
11 1 6 82
3 25 95048 10996356 57044 15044796 0 0 7892 456 3126 7115
4 3 8 85
1 26 95048 10991692 57052 15050100 0 0 5188 176 2566 5976
3 2 12 83
0 29 95048 10985408 57060 15054968 0 0 4200 80 2586 6582
4 1 12 83
1 29 95048 10980828 57064 15058992 0 0 4560 64 2966 7557
7 2 6 85
2 28 95048 10977192 57072 15063176 0 0 3860 72 2695 6742
11 1 7 81
2 29 95048 10969120 57088 15067808 0 0 5084 84 3296 8067
14 1 0 84
0 25 95048 10962096 57104 15072984 0 0 4440 500 2721 6263
12 1 6 80
0 23 95044 10955320 57108 15079260 0 0 5712 232 2678 5990
6 1 6 87
2 25 95044 10948644 57120 15084524 0 0 5120 184 3499 8143
20 3 9 69
3 21 95044 10939744 57128 15090644 0 0 5756 264 4724 10272
32 3 5 60
1 19 95040 10933196 57144 15095024 12 0 4440 180 2585 5244
13 2 15 70
0 21 95040 10927596 57148 15098684 0 0 3248 136 2973 7292
8 1 9 81
1 20 95040 10920708 57164 15104244 0 0 5192 360 1865 4547
3 1 9 87
1 24 95040 10914552 57172 15105856 0 0 2308 16 1948 4450
6 1 1 93
0 24 95036 10909148 57176 15110240 0 0 3824 152 1330 2632
3 1 6 90
1 21 95036 10900628 57192 15116332 0 0 5680 180 1898 3986
4 1 11 84
0 19 95036 10888356 57200 15121736 0 0 5952 120 2252 3991
12 1 8 79
2 22 95036 10874336 57204 15128252 0 0 6320 112 2831 6755
5 2 8 85
3 26 95036 10857592 57220 15134020 0 0 5124 216 3067 5296
32 6 3 59
Alan, my apologies if you get this twice. Didn't reply back to the
list on first try.
On Mar 16, 2009, at 3:52 PM, Alan Hodgson wrote:
On Monday 16 March 2009, Joe Uhl <joeuhl@xxxxxxxxx> wrote:
Right now (not under peak load) this server is running at 68% CPU
utilization and its SATA raid 10 is doing about 2MB/s writes and
11MB/
s reads. When I run dd I can hit 200+MB/s writes and 230+ MB/s
reads,
so we are barely using the available IO. Further when I run dd the
CPU utilization of that process only approaches 20%-30% of one
core.
What does vmstat say when it's slow? The output of "vmstat 1 30"
would be
informative.
note: dd is sequential I/O. Normal database usage is random I/O.
--
Even a sixth-grader can figure out that you can’t borrow money to
pay
off
your debt
--
Sent via pgsql-performance mailing list
(pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Mar 16, 2009, at 4:35 PM, Greg Smith wrote:
On Mon, 16 Mar 2009, Joe Uhl wrote:
Here is vmstat 1 30. We are under peak load right now so I can
gather information from the real deal
Quite helpful, reformatting a bit and picking an informative section:
procs -----------memory---------- ---swap- ----io--- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us
sy id wa
0 34 95048 11025880 56988 15020168 0 0 3852 160 3616 8614
11 1 6 82
3 25 95048 10996356 57044 15044796 0 0 7892 456 3126 7115
4 3 8 85
1 26 95048 10991692 57052 15050100 0 0 5188 176 2566 5976
3 2 12 83
This says that your server is spending all its time waiting for I/O,
actual CPU utilization is minimal. You're only achieving around
3-8MB/s of random I/O. That's the reality of what your disk I/O
subsystem is capable of, regardless of what its sequential
performance with dd looks like. If you were to run a more
complicated benchmark like bonnie++ instead, I'd bet that your
"seeks/second" results are very low, even though sequential read/
write is fine.
The Perc5 controllers have a pretty bad reputation for performance
on this list, even in RAID10. Not much you can do about that beyond
scrapping the controller and getting a better one.
What you might do in order to reduce the total number of writes
needed is some standard postgresql.conf tuning; see http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
What you could do here is increase shared_buffers,
checkpoint_segments, and checkpoint_completion_target as described
there. Having more buffers dedicated to the database and having
less checkpoints can result in less random I/O showing up, as
popular data pages will stay in RAM for longer without getting
written out so much.
--
* Greg Smith gsmith@xxxxxxxxxxxxx http://www.gregsmith.com
Baltimore, MD
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance