Re: FW: Queries becoming slow under heavy load

Scott Marlowe <scott.marlowe@xxxxxxxxx> · Thu, 3 Feb 2011 11:22:24 -0700

Excellent!  And you learned a bit more about how to monitor your
server while you were at it.  Win win!

On Thu, Feb 3, 2011 at 10:40 AM, Anne Rosset <arosset@xxxxxxxxxx> wrote:
> Thanks to all  of you who replied and pointed NFS as a potential
> culprit.
> Our issue was that  pgsql's temp dir (pgsql_tmp)  was set to the default
> value ( $PSQL_DIR/base/pgsql_tmp/)  which was located in NFS.
> Moving the temp dir to local disk got us  a huge improvement.
>
> Anne
>
> -----Original Message-----
> From: Shaun Thomas [mailto:sthomas@xxxxxxxxx]
> Sent: Friday, January 28, 2011 7:31 AM
> To: Anne Rosset
> Cc: pgsql-performance@xxxxxxxxxxxxxx
> Subject: Re: FW:  Queries becoming slow under heavy load
>
> On 01/27/2011 11:12 PM, Anne Rosset wrote:
>
>> Thanks for your response.
>> We are over NFS for our storage ...
>
> NFS? I'm not sure you know this, but NFS has major locking issues that
> would make it a terrible candidate for hosting a database.
>
>> and it's not until around the 221 second mark that we see catch it
> consuming CPU:
>>
>>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>   7090 root      25   0  689m 399m  10m R 93.4  5.0   3872:07 java
>> 28312 postgres  16   0  396m 225m 204m R  5.7  2.8   0:51.52
> postmaster<----- here
>>   3391 root      15   0 29056 2348 1424 R  1.9  0.0   0:00.01 top
>>   4297 root      16   0 10228  740  632 D  0.0  0.0  12:53.66
> hald-addon-stor
>> 26885 httpd     15   0 2263m 1.5g  16m R  0.0 19.0   0:00.01 java
>>
>> Note that the load average is fine during this timeframe, ~4 out of 8,
> so plenty of CPU.
>
> Please listen to us. We asked you to use sar, or iostat, to tell us how
> much the disk IO is being utilized. From your other screenshots, there
> were at least two other PG processes that were running and could have
> been thrashing the disk or locking tables your "slow" query needed. If
> it's waiting for disk IO, the CPU will remain low until it gets what it
> needs.
>
> Not everything is about the CPU. Especially now that we know your DB is
> running on top of NFS.
>
>> Further, or worse yet, this same behavior expands out to multiple
>> processes, producing a true "back up". It can look something like
>> this. Notice the 0% cpu consumption:
>>
>>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>   7090 root      22   0  689m 399m  10m R 91.1  5.0   3874:32 java
>>   4139 root      15   0 29080 2344 1424 R  1.9  0.0   0:00.01 top
>>   1555 postgres  16   0  474m 258m 162m D  0.0  3.2   0:17.32
> postmaster
>>   1846 postgres  16   0  474m 285m 189m D  0.0  3.6   0:47.43
> postmaster
>>   2713 postgres  16   0  404m 202m 179m D  0.0  2.5   0:33.54
> postmaster
>>   2801 postgres  16   0  391m 146m 131m D  0.0  1.8   0:04.48
> postmaster
>>   2804 postgres  16   0  419m 172m 133m D  0.0  2.2   0:09.41
> postmaster
>>   2825 postgres  16   0  473m 142m  49m D  0.0  1.8   0:04.12
> postmaster
>
> Yes. And they could all be waiting for IO. Or NFS locking is blocking
> the reads. Or... what is that Java app doing? We don't know the state of
> your IO, and when you have 0% or very low CPU usage, you either have
> locking contention or you're being IO starved.
>
> And what queries are these connections performing? You can check it by
> getting the contents of the pg_stat_activity system view. If they're
> selecting and still "slow", compare that against the iostat or sar
> results. For instance, here's an IOSTAT of our system:
>
> iostat -dmx dm-9 1
>
> Linux 2.6.18-92.el5 (oslchi6pedb1)      01/28/2011
>
> Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
> dm-9              0.00     0.00 125.46 227.78     4.95     0.89    33.88
> 0.08    0.19   0.08   2.91
>
> Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
> dm-9              0.00     0.00  5.00  0.00     0.04     0.00    14.40
> 0.05   10.60  10.60   5.30
>
> Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
> dm-9              0.00     0.00  2.00  0.00     0.02     0.00    16.00
> 0.01    7.00   7.00   1.40
>
> Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
> dm-9              0.00     0.00  4.00 1184.00     0.04     4.62     8.04
> 27.23   11.73   0.06   6.80
>
> Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
> dm-9              0.00     0.00 11.00 847.00     0.09     3.31     8.10
> 29.31   49.65   0.79  67.90
>
>
> That last column, %util, effectively tells us how saturated the
> controller is. If the percentage is high, it's really working hard to
> supply the data we're asking for, or trying to write. If it's low, we're
> probably working from memory cache, or getting less requests. There have
> been times our queries are "slow" and when we check this stat, it's
> often at or above 90%, sometimes for minutes at a time. That's almost
> always a clear indicator you have IO contention. Queries can't work
> without the data they need to return your results.
>
> Sending us more CPU charts isn't going to help us in helping you.
>
> --
> Shaun Thomas
> OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
> 312-676-8870
> sthomas@xxxxxxxxx
>
> ______________________________________________
>
> See  http://www.peak6.com/email_disclaimer.php
> for terms and conditions related to this email
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

-- 
To understand recursion, one must first understand recursion.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance