Excellent! And you learned a bit more about how to monitor your server while you were at it. Win win! On Thu, Feb 3, 2011 at 10:40 AM, Anne Rosset <arosset@xxxxxxxxxx> wrote: > Thanks to all of you who replied and pointed NFS as a potential > culprit. > Our issue was that pgsql's temp dir (pgsql_tmp) was set to the default > value ( $PSQL_DIR/base/pgsql_tmp/) which was located in NFS. > Moving the temp dir to local disk got us a huge improvement. > > Anne > > -----Original Message----- > From: Shaun Thomas [mailto:sthomas@xxxxxxxxx] > Sent: Friday, January 28, 2011 7:31 AM > To: Anne Rosset > Cc: pgsql-performance@xxxxxxxxxxxxxx > Subject: Re: FW: Queries becoming slow under heavy load > > On 01/27/2011 11:12 PM, Anne Rosset wrote: > >> Thanks for your response. >> We are over NFS for our storage ... > > NFS? I'm not sure you know this, but NFS has major locking issues that > would make it a terrible candidate for hosting a database. > >> and it's not until around the 221 second mark that we see catch it > consuming CPU: >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 7090 root 25 0 689m 399m 10m R 93.4 5.0 3872:07 java >> 28312 postgres 16 0 396m 225m 204m R 5.7 2.8 0:51.52 > postmaster<----- here >> 3391 root 15 0 29056 2348 1424 R 1.9 0.0 0:00.01 top >> 4297 root 16 0 10228 740 632 D 0.0 0.0 12:53.66 > hald-addon-stor >> 26885 httpd 15 0 2263m 1.5g 16m R 0.0 19.0 0:00.01 java >> >> Note that the load average is fine during this timeframe, ~4 out of 8, > so plenty of CPU. > > Please listen to us. We asked you to use sar, or iostat, to tell us how > much the disk IO is being utilized. From your other screenshots, there > were at least two other PG processes that were running and could have > been thrashing the disk or locking tables your "slow" query needed. If > it's waiting for disk IO, the CPU will remain low until it gets what it > needs. > > Not everything is about the CPU. Especially now that we know your DB is > running on top of NFS. > >> Further, or worse yet, this same behavior expands out to multiple >> processes, producing a true "back up". It can look something like >> this. Notice the 0% cpu consumption: >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 7090 root 22 0 689m 399m 10m R 91.1 5.0 3874:32 java >> 4139 root 15 0 29080 2344 1424 R 1.9 0.0 0:00.01 top >> 1555 postgres 16 0 474m 258m 162m D 0.0 3.2 0:17.32 > postmaster >> 1846 postgres 16 0 474m 285m 189m D 0.0 3.6 0:47.43 > postmaster >> 2713 postgres 16 0 404m 202m 179m D 0.0 2.5 0:33.54 > postmaster >> 2801 postgres 16 0 391m 146m 131m D 0.0 1.8 0:04.48 > postmaster >> 2804 postgres 16 0 419m 172m 133m D 0.0 2.2 0:09.41 > postmaster >> 2825 postgres 16 0 473m 142m 49m D 0.0 1.8 0:04.12 > postmaster > > Yes. And they could all be waiting for IO. Or NFS locking is blocking > the reads. Or... what is that Java app doing? We don't know the state of > your IO, and when you have 0% or very low CPU usage, you either have > locking contention or you're being IO starved. > > And what queries are these connections performing? You can check it by > getting the contents of the pg_stat_activity system view. If they're > selecting and still "slow", compare that against the iostat or sar > results. For instance, here's an IOSTAT of our system: > > iostat -dmx dm-9 1 > > Linux 2.6.18-92.el5 (oslchi6pedb1) 01/28/2011 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await svctm %util > dm-9 0.00 0.00 125.46 227.78 4.95 0.89 33.88 > 0.08 0.19 0.08 2.91 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await svctm %util > dm-9 0.00 0.00 5.00 0.00 0.04 0.00 14.40 > 0.05 10.60 10.60 5.30 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await svctm %util > dm-9 0.00 0.00 2.00 0.00 0.02 0.00 16.00 > 0.01 7.00 7.00 1.40 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await svctm %util > dm-9 0.00 0.00 4.00 1184.00 0.04 4.62 8.04 > 27.23 11.73 0.06 6.80 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz > avgqu-sz await svctm %util > dm-9 0.00 0.00 11.00 847.00 0.09 3.31 8.10 > 29.31 49.65 0.79 67.90 > > > That last column, %util, effectively tells us how saturated the > controller is. If the percentage is high, it's really working hard to > supply the data we're asking for, or trying to write. If it's low, we're > probably working from memory cache, or getting less requests. There have > been times our queries are "slow" and when we check this stat, it's > often at or above 90%, sometimes for minutes at a time. That's almost > always a clear indicator you have IO contention. Queries can't work > without the data they need to return your results. > > Sending us more CPU charts isn't going to help us in helping you. > > -- > Shaun Thomas > OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 > 312-676-8870 > sthomas@xxxxxxxxx > > ______________________________________________ > > See http://www.peak6.com/email_disclaimer.php > for terms and conditions related to this email > > -- > Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- To understand recursion, one must first understand recursion. -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance