On 10/27/2011 02:27 PM, Scott Mead wrote:
On Thu, Oct 27, 2011 at 2:39 PM, Brian
Fehrle <brianf@xxxxxxxxxxxxxxxxxxx>
wrote:
Hi all, need some help/clues on tracking down a performance
issue.
PostgreSQL version: 8.3.11
I've got a system that has 32 cores and 128 gigs of ram. We
have connection pooling set up, with about 100 - 200
persistent connections open to the database. Our applications
then use these connections to query the database constantly,
but when a connection isn't currently executing a query, it's
<IDLE>. On average, at any given time, there are 3 - 6
connections that are actually executing a query, while the
rest are <IDLE>.
Remember, when you read pg_stat_activity, it is showing you
query activity from that exact specific moment in time. Just
because it looks like only 3-6 connections are executing,
doesn't mean that 200 aren't actually executing < .1ms
statements. With such a beefy box, I would see if you can
examine any stats from your connection pooler to find out how
many connections are actually getting used.
Correct, we're getting a few hundred transactions per second, but
under normal operation, polling pg_stat_activity will show the
average of 3 - 6 queries that were running at that moment, and those
queries run for an average of 5 - 7 seconds. So my belief is that
something happens to the system where either a) We get a ton more
queries than normal from the application (currently hunting down
data to support this), or b) the overall speed of the system slows
down so that all queries increase in time so much that polling
pg_stat_activity lets me actually see them.
About once a day, queries that normally take just a few
seconds slow way down, and start to pile up, to the point
where instead of just having 3-6 queries running at any given
time, we get 100 - 200. The whole system comes to a crawl, and
looking at top, the CPU usage is 99%.
Looking at top, I see no SWAP usage, very little IOWait, and
there are a large number of postmaster processes at 100% cpu
usage (makes sense, at this point there are 150 or so queries
currently executing on the database).
Tasks: 713 total, 44 running, 668 sleeping, 0 stopped, 1
zombie
Cpu(s): 4.4%us, 92.0%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi,
0.3%si, 0.2%st
Mem: 134217728k total, 131229972k used, 2987756k free,
462444k buffers
Swap: 8388600k total, 296k used, 8388304k free,
119029580k cached
In the past, we noticed that autovacuum was hitting some large
tables at the same time this happened, so we turned autovacuum
off to see if that was the issue, and it still happened
without any vacuums running.
That was my next question :)
We also ruled out checkpoints being the cause.
How exactly did you rule this out? Just because a
checkpoint is over doesn't mean that it hasn't had a negative
effect on the OS cache. If you're stuck going to disk, that
could be hurting you (that being said, you do point to a low
I/O wait above, so you're probably correct in ruling this
out).
Checkpoint settings were set to the default per install. 5 minute
timeout, 0.5 completion target, and 30s warning. Looking at the
logs, we were getting a checkpoint every 5 minutes on the dot.
I looked at the data in pg_stat_database and noticed that buffers
written by checkpoints are near 4X that of the background writer. So
I implemented some changes to get more to be written by the
background writer, including increasing the checkpoint timeout to 30
minutes, and setting the frequency of the bgwriter wait time from
200ms to 50ms.
checkpoints now happen 30 mins apart on the dot, and there was not a
checkpoint happening the last time this issue of major slowdown
occured.
I'm currently digging through some statistics I've been
gathering to see if traffic increased at all, or remained the
same when the slowdown occurred. I'm also digging through the
logs from the postgresql cluster (I increased verbosity
yesterday), looking for any clues. Any suggestions or clues on
where to look for this to see what can be causing a slowdown
like this would be greatly appreciated.
Are you capturing table-level stats from
pg_stat_user_[tables | indexes]? Just because a server
doesn't look busy doesn't mean that you're not doing 1000
index scans per second returning 1000 tuples each time.
I am not grabbing any of those at the moment, I'll look into those.
- Brian F
|