Occasional spike in query response time (jumps from 200ms to 45 seconds)

Robert Voinea <rvoinea@xxxxxxxxx> · Thu, 07 Nov 2013 12:44:59 +0200

Hi list

>From time to time (1-2 months) one of the production systems I manage 
starts acting crazy... and this is starting to become a problem.

Every query I send to the server has a very long running time (sometimes it 
reaches 45+ seconds). Even the simples queries like "SELECT NOW()" run in 4-5 
seconds... or more.

This goes on for a few minutes. After that, everything is back to normal.
I must say that this is not a big system -- 2 databases: one with a few 
hundred tables, but not a lot of information; the other has a few 
tables, but around 15GB of data (the tables are partitions by month). No heavy 
load. All the queries are normally running in less than 20ms; only a few of 
them are running in more than 20ms, but they peak at 200ms.

The servers are HP ProLiant DL360 G7 with 8GB RAM and 1GB RAM for the RAID 
controller (HP Smart Array G6 controller). The system uses two disks in 
mirroring. There is no dedicated disk for the database.
I am running CentOS 6.3 (kernel 2.6.32-279.9.1.el6.i686).
PostgreSQL 9.1.6 was compiled from sources.

I'm guessing that, somehow, there is heavy I/O usage at that time (and I am 
starting to suspect the autovacuum daemon).
I have tried monitoring the server with iotop and iostat, but I couldn't find 
out anything useful.
Yes, the database server is using a lot of I/O, especially the clients that 
connect to the second database (the 15GB one), but that was to be expected. 
Also, the other processes running on the server (there are a few other 
clients) don't use that much I/O bandwidth (they use the processor more, but 
not enough to stall the whole system), so I have excluded them.

At this moment, an upgrade is not desired, but if there are hints that an 
upgrade would solve this issue, then I'll insist on it.
I've seen that postgres 9.1.8 Changelog contains some bug fixes related to 
performance of autovaccum:

// BEGIN QUOTE
Fix performance problems with autovacuum truncation in busy workloads (Jan 
Wieck)

Truncation of empty pages at the end of a table requires exclusive lock, but 
autovacuum was coded to fail (and release the table lock) when there are 
conflicting lock requests. Under load, it is easily possible that truncation 
would never occur, resulting in table bloat. Fix by performing a partial 
truncation, releasing the lock, then attempting to re-acquire the lock and 
continue. This fix also greatly reduces the average time before autovacuum 
releases the lock after a conflicting request arrives.

Fix error in vacuum_freeze_table_age implementation (Andres Freund)

In installations that have existed for more than vacuum_freeze_min_age 
transactions, this mistake prevented autovacuum from using partial-table 
scans, so that a full-table scan would always happen instead.
// END QUOTE

Could you give me some hints on what to look for?
What might be causing this behavior?
Any idea is greatly appreciated!

Thank you!
-- 
Robert Voinea
Software Engineer
+4 0740 467 262

Don't take life too seriously. You'll never get out of it alive.
(Elbert Hubbard)

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin