Scott, Thanks for your response. We are over NFS for our storage ... Here is what we see during our performance testing: This is about 7 seconds after the query was sent to postgres: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7090 root 25 0 689m 399m 10m R 89.9 5.0 3868:44 java 1846 postgres 16 0 474m 198m 103m R 75.2 2.5 0:28.69 postmaster 2170 postgres 15 0 391m 203m 188m R 44.0 2.6 0:17.63 postmaster 2555 httpd 18 0 298m 15m 4808 R 22.0 0.2 0:00.12 httpd 2558 root 15 0 29056 2324 1424 R 1.8 0.0 0:00.01 top 1207 httpd 15 0 337m 20m 7064 R 0.0 0.3 0:00.69 httpd 28312 postgres 16 0 396m 183m 162m D 0.0 2.3 0:50.82 postmaster <---- this is the query here Notice the 0% CPU, also, notice the 183m RES memory. Ten seconds later: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7090 root 25 0 689m 399m 10m R 92.9 5.0 3868:53 java 2657 root 15 0 29056 2328 1424 R 1.9 0.0 0:00.01 top 28312 postgres 16 0 396m 184m 162m D 0.0 2.3 0:50.84 postmaster <---- here Ten seconds after that: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7090 root 25 0 689m 399m 10m R 88.7 5.0 3869:02 java 1845 postgres 16 0 473m 223m 127m D 22.6 2.8 0:26.39 postmaster 2412 httpd 15 0 2245m 1.4g 16m R 18.9 17.8 0:02.48 java 966 postgres 15 0 395m 242m 221m D 0.0 3.0 1:02.31 postmaster 2680 root 15 0 29056 2336 1424 R 0.0 0.0 0:00.01 top 28312 postgres 16 0 396m 184m 163m D 0.0 2.3 0:50.85 postmaster <--- here etc.... and it's not until around the 221 second mark that we see catch it consuming CPU: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7090 root 25 0 689m 399m 10m R 93.4 5.0 3872:07 java 28312 postgres 16 0 396m 225m 204m R 5.7 2.8 0:51.52 postmaster <----- here 3391 root 15 0 29056 2348 1424 R 1.9 0.0 0:00.01 top 4297 root 16 0 10228 740 632 D 0.0 0.0 12:53.66 hald-addon-stor 26885 httpd 15 0 2263m 1.5g 16m R 0.0 19.0 0:00.01 java Note that the load average is fine during this timeframe, ~4 out of 8, so plenty of CPU. Looks like this is true "halting". Further, or worse yet, this same behavior expands out to multiple processes, producing a true "back up". It can look something like this. Notice the 0% cpu consumption: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7090 root 22 0 689m 399m 10m R 91.1 5.0 3874:32 java 4139 root 15 0 29080 2344 1424 R 1.9 0.0 0:00.01 top 1555 postgres 16 0 474m 258m 162m D 0.0 3.2 0:17.32 postmaster 1846 postgres 16 0 474m 285m 189m D 0.0 3.6 0:47.43 postmaster 2713 postgres 16 0 404m 202m 179m D 0.0 2.5 0:33.54 postmaster 2801 postgres 16 0 391m 146m 131m D 0.0 1.8 0:04.48 postmaster 2804 postgres 16 0 419m 172m 133m D 0.0 2.2 0:09.41 postmaster 2825 postgres 16 0 473m 142m 49m D 0.0 1.8 0:04.12 postmaster Thanks for any additional explanation/advice, Anne -----Original Message----- From: Scott Marlowe [mailto:scott.marlowe@xxxxxxxxx] Sent: Wednesday, January 26, 2011 8:19 PM To: Anne Rosset Cc: pgsql-performance@xxxxxxxxxxxxxx Subject: Re: FW: Queries becoming slow under heavy load On Wed, Jan 26, 2011 at 9:04 AM, Anne Rosset <arosset@xxxxxxxxxx> wrote: <HUGE LIST OF SETTINGS DELETED> PLEASE post just the settings you changed. I'm not searching through a list that big for the interesting bits. > Today we did more analysis and observed postgress processes that > continually reported status 'D' in top. Full stop. The most likely problem here is that the query is now hitting the disks and waiting. If you have 1 disk and two users, the access speed will drop by factors, usually much higher than 2. To put it very simply, you need as many mirror pairs in your RAID-10 or as many disks in your RAID5 or RAID 6 as you have users reading the disk drives. If you're writing you need more and more disks too. Mediating this issue we find things like SSD cache in ZFS or battery backed RAID controllers. They allow the reads and writes to be streamlined quite a bit to the spinning disks, making it appear the RAID array underneath it was much faster, had better access, and all the sectors were near each other. To an extent. If you have the answer to the previous poster's question "can you tell us what sort of IO you have (sata, scsi, raid, # of disks, etc)." you should provide it. If you've got a pair of 5k RPM SATA drives in a RAID-1 you might need more hardware. So, instead of just settings, show us a few carefully selected lines of output from vmstat or iostat while this is happening. Don't tell us what you see, show us. -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance