Re: FW: Queries becoming slow under heavy load

"Anne Rosset" <arosset@xxxxxxxxxx> · Thu, 27 Jan 2011 21:12:36 -0800

Scott,
Thanks for your response.
We are over NFS for our storage ...

Here is what we see during our performance testing:
This is about 7 seconds after the query was sent to postgres:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 7090 root      25   0  689m 399m  10m R 89.9  5.0   3868:44 java               
 1846 postgres  16   0  474m 198m 103m R 75.2  2.5   0:28.69 postmaster         
 2170 postgres  15   0  391m 203m 188m R 44.0  2.6   0:17.63 postmaster         
 2555 httpd     18   0  298m  15m 4808 R 22.0  0.2   0:00.12 httpd              
 2558 root      15   0 29056 2324 1424 R  1.8  0.0   0:00.01 top                
 1207 httpd     15   0  337m  20m 7064 R  0.0  0.3   0:00.69 httpd              
28312 postgres  16   0  396m 183m 162m D  0.0  2.3   0:50.82 postmaster  <---- this is the query here

Notice the 0% CPU, also, notice the 183m RES memory.

Ten seconds later:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 7090 root      25   0  689m 399m  10m R 92.9  5.0   3868:53 java               
 2657 root      15   0 29056 2328 1424 R  1.9  0.0   0:00.01 top                
28312 postgres  16   0  396m 184m 162m D  0.0  2.3   0:50.84 postmaster  <---- here

Ten seconds after that:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 7090 root      25   0  689m 399m  10m R 88.7  5.0   3869:02 java               
 1845 postgres  16   0  473m 223m 127m D 22.6  2.8   0:26.39 postmaster         
 2412 httpd     15   0 2245m 1.4g  16m R 18.9 17.8   0:02.48 java               
  966 postgres  15   0  395m 242m 221m D  0.0  3.0   1:02.31 postmaster         
 2680 root      15   0 29056 2336 1424 R  0.0  0.0   0:00.01 top                
28312 postgres  16   0  396m 184m 163m D  0.0  2.3   0:50.85 postmaster  <--- here

etc....

and it's not until around the 221 second mark that we see catch it consuming CPU:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 7090 root      25   0  689m 399m  10m R 93.4  5.0   3872:07 java               
28312 postgres  16   0  396m 225m 204m R  5.7  2.8   0:51.52 postmaster   <----- here         
 3391 root      15   0 29056 2348 1424 R  1.9  0.0   0:00.01 top                
 4297 root      16   0 10228  740  632 D  0.0  0.0  12:53.66 hald-addon-stor    
26885 httpd     15   0 2263m 1.5g  16m R  0.0 19.0   0:00.01 java

Note that the load average is fine during this timeframe, ~4 out of 8, so plenty of CPU.

Looks like this is true "halting".

Further, or worse yet, this same behavior expands out to multiple processes, producing a true "back up". It can look 
something like this. Notice the 0% cpu consumption:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 7090 root      22   0  689m 399m  10m R 91.1  5.0   3874:32 java               
 4139 root      15   0 29080 2344 1424 R  1.9  0.0   0:00.01 top                
 1555 postgres  16   0  474m 258m 162m D  0.0  3.2   0:17.32 postmaster         
 1846 postgres  16   0  474m 285m 189m D  0.0  3.6   0:47.43 postmaster         
 2713 postgres  16   0  404m 202m 179m D  0.0  2.5   0:33.54 postmaster         
 2801 postgres  16   0  391m 146m 131m D  0.0  1.8   0:04.48 postmaster         
 2804 postgres  16   0  419m 172m 133m D  0.0  2.2   0:09.41 postmaster         
 2825 postgres  16   0  473m 142m  49m D  0.0  1.8   0:04.12 postmaster

Thanks for any additional explanation/advice,
Anne

-----Original Message-----
From: Scott Marlowe [mailto:scott.marlowe@xxxxxxxxx] 
Sent: Wednesday, January 26, 2011 8:19 PM
To: Anne Rosset
Cc: pgsql-performance@xxxxxxxxxxxxxx
Subject: Re: FW:  Queries becoming slow under heavy load

On Wed, Jan 26, 2011 at 9:04 AM, Anne Rosset <arosset@xxxxxxxxxx> wrote:

<HUGE LIST OF SETTINGS DELETED>

PLEASE post just the settings you changed.  I'm not searching through a list that big for the interesting bits.

> Today we did more analysis and observed  postgress processes that 
> continually reported status 'D' in top.

Full stop.  The most likely problem here is that the query is now hitting the disks and waiting.  If you have 1 disk and two users, the access speed will drop by factors, usually much higher than 2.

To put it very simply, you need as many mirror pairs in your RAID-10 or as many disks in your RAID5 or RAID 6 as you have users reading the disk drives.  If you're writing you need more and more disks too.
Mediating this issue we find things like SSD cache in ZFS or battery backed RAID controllers.  They allow the reads and writes to be streamlined quite a bit to the spinning disks, making it appear the RAID array underneath it was much faster, had better access, and all
the sectors were near each other.   To an extent.

If you have the answer to the previous poster's question "can you tell us what sort of IO you have (sata, scsi, raid, # of disks, etc)." you should provide it.  If you've got a pair of 5k RPM SATA drives in a
RAID-1 you might need more hardware.

So, instead of just settings, show us a few carefully selected lines of output from vmstat or iostat while this is happening.  Don't tell us what you see, show us.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance