Hi everyone!
I've been working on a puzzling issue for a few days am am hoping that someone has seen something similar or can help. There have been some odd behaviors on one of my production facing postgres servers.
version info from postgres: PostgreSQL 9.1.9 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit
The symptom: The database machine (running postgres 9.1.9 on CentOS 6.4) is running a low utilization most of the time, but once every day or two, it will appear to slow down to the point where queries back up and clients are unable to connect. Once this event occurs, there are lots of concurrent queries, I see slow queries appear in the logs, but there doesn't appear to be anything abnormal that I have been able to see that causes this behavior. The event will occur just long enough for monitoring to alarm. We will respond to alerts to take a look, but within a minute or three at most, load returns back to normal levels and all running queries complete in expected times.
At the time of the event, we see a spike in system CPU and load average, but we do not see a corresponding spike in disk reads or writes which would indicate IO load. Initial troubleshooting to monitor active processes led us to see a flurry of activity in ps waiting on semtimedop. Our efforts internally to diagnose this problem are to sample pg_locks and pg_stat_activity every 5s plus running a script to look for at least one postgres process waiting on a semaphore, and if it finds one, it gets a stack trace of every running postgres processes with GDB. It also uses strace on 5 processes to find out which semaphore they're waiting on.
What we were catching in the following stack trace seems to be representative of where things are waiting when we see an event - here are two examples that are representative - lots of threads will appear to be in this state:
----- 47245 -----
Has any on the forum seen something similar? Any suggestions on what to look at next? If it is helpful to describe the server hardware, it's got 2 E5-2670 cpu and 256 GB of ram, and the database is hosted on 1.6TB raid 10 local storage (15K 300 GB drives). The workload is predominantly read and the queries are mostly fairly simple selects from a single large table generally specifying the primary key as part of the where clause along with a few other filters.
Thanks, Matt |