On 6 Září 2011, 10:55, Gerhard Wohlgenannt wrote: > >> That's why I love dstat, just do this >> >> $ dstat -C 0,1,2,3,4,5,6,7 >> >> and you know all you need. > > dstat looks like a very nice tool, results below .. > (now the system load seems a bit lower then before when generating > results for vmstat and iostat) >>> Good catch, thanks Scott. >> Yes, good catch. >> >> Still, this does not explain why the queries were running fast before, >> and >> why the RAID array is so sluggish. Not to mention that we don't know >> what >> were the conditions when collecting those numbers (were the VMs off or >> running?). >> > the VMs were running. they are in something like production use, so i > shouldn't just turn them off .. :-) > and the processes in the VMs cause a big portion of the DB load, so > turning them off would distort the results ... Distort the results? If you want to measure the RAID performance, you have to do that when there are no other processes using it. > and thanks again for all the replies!!! :-) Please, use something like pastebin.com to post there results. It was bearable for the vmstat output but this is alamost unreadable due to the wrapping. > ~# dstat -C 0,1,2,3,4,5,6,7 > -------cpu0-usage--------------cpu1-usage--------------cpu2-usage--------------cpu3-usage--------------cpu4-usage--------------cpu5-usage--------------cpu6-usage--------------cpu7-usage------ > -dsk/total- -net/total- ---paging-- ---system-- > usr sys idl wai hiq siq:usr sys idl wai hiq siq:usr sys idl wai hiq > siq:usr sys idl wai hiq siq:usr sys idl wai hiq siq:usr sys idl wai hiq > siq:usr sys idl wai hiq siq:usr sys idl wai hiq siq| read writ| recv > send| in out | int csw > 7 1 75 17 0 0: 4 5 84 7 0 0: 5 3 80 12 0 > 0: 4 3 85 9 0 0: 7 2 75 16 0 0: 4 2 87 8 0 > 0: 7 2 75 16 0 0: 4 1 87 8 0 0|5071k 2578k| 0 0 > |9760B 9431B|2468 4126 ... But if I read that correctly, the wait for the cores is 17%, 7%, 12%, 9%, 16%, 8%, 16% and 8%, and the cores are mostly idle (idl is about 85%). So it seems there's a low number of processes, switched between the cpus and most of the time they're waiting for the I/O. Given the low values for disk I/O and the iostat output we've seen before, it's obvious there's a lot of random I/O (mostly writes). Let's speculate for a while what could cause this (in arbitrary order): 1) Checkpoints. Something is doing a lot of writes, and with DB that often means a checkpoint is in progress. I'm not sure about your checkpoint_timeout, but you do have 192 segments and about 7GB of shared buffers. That means there may be a lot of dirty buffers (even 100% of the buffers). You're using RAID5 and that really is not a write-friendly RAID version. We don't know actual performance as the bonnie was run with VMs accessing the volume, but RAID10 usually performs much better. Enable log_checkpoints in the config and see what's going on. You can also use iotop to see what processes are doing the writes (it might be a background writer, ...). 2) The RAID is broken and can't handle the load it handled fine before. This is not very likely, as you've mentioned that there were no warnings etc. 3) There are some new jobs that do a lot of I/O. Is there anything new that wasn't running before? I guess you'd mention that. 4) The database significantly grew in a short period of time, and the active part now does not fit into the RAM (page cache), so the data has to be retrieved from the disk. And it's not just about the database size, it's about the active part of the database - if you're suddenly accessing more data, the cache may not be large enough. This is usually a gradual process (cache hit ratio slowly decreases as the database grows), but if the database grew rapidly ... This could be a caused by MVCC, i.e. there may be a lot of dead tuples - have you done a big UPDATE / DELETE or something like that recently? regards Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance