Sometimes I see a strange slowdown on my PG 9.1 server: it looks like the simplest queries which typically take 1ms or less (e.g. selection of a row by its primary key) take 300ms or even more. It is related to all queries within the connection, not the single one: once upon a time all fast queries start to work 300 times slower. In a couple of minutes the effect disappears. The machine is not overloaded, number of connections is not too high, mtr shows no packet loss.
E.g. see the following log from a particular website page I've been refreshing (times are in milliseconds, you may see that typically all 22 queries take near 55ms, but within 3 minutes - from 20:21 till 20:23 approx - even simplest queries took more than 300ms instead of 1ms, so the total time is many seconds).
web-01: [2012-11-20 20:20:57] web07725] total=250 cpu=210 php=203 dbq=22 db=55ms
web-01: [2012-11-20 20:21:06] web07752] total=6254 cpu=220 php=219 dbq=22 db=6035ms <--
web-01: [2012-11-20 20:21:48] web07718] total=7506 cpu=220 php=222 dbq=22 db=7284ms <--
web-01: [2012-11-20 20:22:15] web07717] total=11143 cpu=224 php=225 dbq=22 db=10918ms <--
web-01: [2012-11-20 20:23:44] web07725] total=4402 cpu=224 php=220 dbq=22 db=4182ms <--
web-01: [2012-11-20 20:24:03] web07725] total=268 cpu=212 php=211 dbq=22 db=57ms
The machine is not overloaded during all this period (I've been logging "top" and "pstree" each 10 seconds, results are quite typical):
Nov 20 20:23:07 pg-m03 mon: [Tue Nov 20 20:23:07 MSK 2012]
Nov 20 20:23:07 pg-m03 mon: init-+-crond
Nov 20 20:23:07 pg-m03 mon: |-master-+-pickup
Nov 20 20:23:07 pg-m03 mon: | `-qmgr
Nov 20 20:23:07 pg-m03 mon: |-postmaster---32*[postmaster]
Nov 20 20:23:07 pg-m03 mon: |-screen---sh---su---bash---monitoring.sh-+-logger
Nov 20 20:23:07 pg-m03 mon: | `-monitoring.sh---pstree
Nov 20 20:23:07 pg-m03 mon: |-sshd---sshd---perl
Nov 20 20:23:07 pg-m03 mon: |-syslog-ng
Nov 20 20:23:07 pg-m03 mon: |-udevd
Nov 20 20:23:07 pg-m03 mon: |-vzctl---bash---mc---bash
Nov 20 20:23:07 pg-m03 mon: `-xinetd
Nov 20 20:23:07 pg-m03 mon:
Nov 20 20:23:08 pg-m03 mon: top - 20:23:08 up 113 days, 3:29, 0 users, load average: 2.16, 2.10, 2.00
Nov 20 20:23:08 pg-m03 mon: Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
Nov 20 20:23:08 pg-m03 mon: Cpu(s): 23.8%us, 2.8%sy, 0.0%ni, 72.7%id, 0.8%wa, 0.0%hi, 0.0%si, 0.0%st
Nov 20 20:23:08 pg-m03 mon: Mem: 1463616k total, 1097776k used, 365840k free, 0k buffers
Also I've been logging all queries which are active in the database each 10 seconds. During the slowdown the database was not overloaded, no strange queries, except one thing: during the whole slowdown period two pg processes was in "<IDLE> in transaction" state, and in 3 minutes (see 3rd column - it's a "time from start of the query") these processes disappeared, and the slowdown disappeared too:
| duration | query
Nov 20 20:22:05 pg-m03 mon: 21586 | 00:00:37.571057 | <IDLE> in transaction
Nov 20 20:22:05 pg-m03 mon: 21567 | 00:00:37.557606 | <IDLE> in transaction
Nov 20 20:22:36 pg-m03 mon: 21586 | 00:01:09.119639 | <IDLE> in transaction
Nov 20 20:22:36 pg-m03 mon: 21567 | 00:01:09.106188 | <IDLE> in transaction
Nov 20 20:22:47 pg-m03 mon: 21586 | 00:01:19.637888 | <IDLE> in transaction
Nov 20 20:22:47 pg-m03 mon: 21567 | 00:01:19.624437 | <IDLE> in transaction
Nov 20 20:22:57 pg-m03 mon: 21586 | 00:01:30.155492 | <IDLE> in transaction
Nov 20 20:22:57 pg-m03 mon: 21567 | 00:01:30.142041 | <IDLE> in transaction
Nov 20 20:24:21 pg-m03 mon: 21586 | 00:02:54.294799 | <IDLE> in transaction
Nov 20 20:24:21 pg-m03 mon: 21567 | 00:02:54.281348 | <IDLE> in transaction
Nov 20 20:24:31 - no more "<IDLE> in transaction" (seems they disappeared after 3 minutes timeout?)
Maybe these are correlated? But how could it be that two "idle" processes cause slowdown of other connections (note: not "timeout", but exactly - slowdown from 1ms to 300ms)?
What could be the problem of all these things? Such slowdown typically happens 2-3 times per day, no direct correlation with checkpoints, disks loads or something else. It just starts to work slow even on simplest queries, and then it restores normal operations.