Re: FW: Queries becoming slow under heavy load

Shaun Thomas <sthomas@xxxxxxxxx> · Fri, 28 Jan 2011 09:31:10 -0600

On 01/27/2011 11:12 PM, Anne Rosset wrote:

> Thanks for your response.
> We are over NFS for our storage ...

NFS? I'm not sure you know this, but NFS has major locking issues
that would make it a terrible candidate for hosting a database.

> and it's not until around the 221 second mark that we see catch it consuming CPU:
> 
>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>   7090 root      25   0  689m 399m  10m R 93.4  5.0   3872:07 java
> 28312 postgres  16   0  396m 225m 204m R  5.7  2.8   0:51.52 postmaster<----- here
>   3391 root      15   0 29056 2348 1424 R  1.9  0.0   0:00.01 top
>   4297 root      16   0 10228  740  632 D  0.0  0.0  12:53.66 hald-addon-stor
> 26885 httpd     15   0 2263m 1.5g  16m R  0.0 19.0   0:00.01 java
> 
> Note that the load average is fine during this timeframe, ~4 out of 8, so plenty of CPU.

Please listen to us. We asked you to use sar, or iostat, to tell
us how much the disk IO is being utilized. From your other
screenshots, there were at least two other PG processes that 
were running and could have been thrashing the disk or locking 
tables your "slow" query needed. If it's waiting for disk IO, the 
CPU will remain low until it gets what it needs.

Not everything is about the CPU. Especially now that we know your DB is 
running on top of NFS.

> Further, or worse yet, this same behavior expands out to multiple processes, 
> producing a true "back up". It can look
> something like this. Notice the 0% cpu consumption:
> 
>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>   7090 root      22   0  689m 399m  10m R 91.1  5.0   3874:32 java
>   4139 root      15   0 29080 2344 1424 R  1.9  0.0   0:00.01 top
>   1555 postgres  16   0  474m 258m 162m D  0.0  3.2   0:17.32 postmaster
>   1846 postgres  16   0  474m 285m 189m D  0.0  3.6   0:47.43 postmaster
>   2713 postgres  16   0  404m 202m 179m D  0.0  2.5   0:33.54 postmaster
>   2801 postgres  16   0  391m 146m 131m D  0.0  1.8   0:04.48 postmaster
>   2804 postgres  16   0  419m 172m 133m D  0.0  2.2   0:09.41 postmaster
>   2825 postgres  16   0  473m 142m  49m D  0.0  1.8   0:04.12 postmaster

Yes. And they could all be waiting for IO. Or NFS locking is blocking 
the reads. Or... what is that Java app doing? We don't know the state 
of your IO, and when you have 0% or very low CPU usage, you either have 
locking contention or you're being IO starved.

And what queries are these connections performing? You can check it by 
getting the contents of the pg_stat_activity system view. If they're 
selecting and still "slow", compare that against the iostat or sar 
results. For instance, here's an IOSTAT of our system:

iostat -dmx dm-9 1

Linux 2.6.18-92.el5 (oslchi6pedb1) 	01/28/2011

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-9              0.00     0.00 125.46 227.78     4.95     0.89    33.88     0.08    0.19   0.08   2.91

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-9              0.00     0.00  5.00  0.00     0.04     0.00    14.40     0.05   10.60  10.60   5.30

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-9              0.00     0.00  2.00  0.00     0.02     0.00    16.00     0.01    7.00   7.00   1.40

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-9              0.00     0.00  4.00 1184.00     0.04     4.62     8.04    27.23   11.73   0.06   6.80

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
dm-9              0.00     0.00 11.00 847.00     0.09     3.31     8.10    29.31   49.65   0.79  67.90

That last column, %util, effectively tells us how saturated the
controller is. If the percentage is high, it's really working 
hard to supply the data we're asking for, or trying to write. If 
it's low, we're probably working from memory cache, or getting 
less requests. There have been times our queries are "slow" and 
when we check this stat, it's often at or above 90%, sometimes 
for minutes at a time. That's almost always a clear indicator 
you have IO contention. Queries can't work without the data 
they need to return your results.

Sending us more CPU charts isn't going to help us in helping
you.

-- 
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@xxxxxxxxx

______________________________________________

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance