On Thu, Jun 19, 2014 at 2:35 PM, Kevin Grittner <kgrittn@xxxxxxxxx> wrote: > "Vasudevan, Ramya" <ramya.vasudevan@xxxxxxxxxxxxxx> wrote: > >> On the waiting queries - When we reached 1500 connections, we >> had 759 connections that were in active state (116 COMMIT, 238 >> INSERT, UPDATE 176, 57 AUTHENTICATION, 133 BIND. These active >> INSERTS and UPDATES also includes the 80-90 waiting sessions (We >> checked pg_stat_activity for 'waiting' state. And pg_locks for >> granted=f). The blocking and the waiting queries were simple one >> row updates/inserts/deletes. These shouldn’t be blocking each >> other normally (unless, we think, there was a problem writing to >> the disk). Correct me if I am wrong. > > You may want to consider this: > > http://wiki.postgresql.org/wiki/Number_Of_Database_Connections > >> [vmstat show up to 173111 context switches per second, with high >> cs rate corresponding to user CPU% between 64 and 82] > > That usually means there is high contention for spinlocks, > potentially with processes getting suspended while holding > spinlocks, making things worse. Completely agree on both diagnosis and proposed solution -- load profile (low iowait, high user%, high processes, high CS) is symptomatic of too many processes trying to do things at once. So there may be some application caused driver of the problem or you are hitting a contention point within postgres (a 'perf top' might give clues to the latter). Either way, once you are in this state you end up with too many processes fighting for cpu and cache lines which exaggerates the problem to the point you can classify it as an outage. Be advised transaction mode pooling makes certain features of the database difficult or impossible to use -- advisory locks (except xact variants), server side prepared statements, asynchronous notificiation, WITH HOLD cursors and the like -- basically anything scoped to the session. For many workloads it is a high win though. If for whatever reason this solution doesn't work, your other options are to try to optimize whatever is causing the load event (particularly if it's in your code -- careful query logging might give some clues) or to simply upgrade hardware (more/faster cpu especially for your case) to the point that even when highly loaded you're always clearing queries at an acceptable rate. The hardware approach has some risk though -- if you have a contention problem it's not always a given that adding cores will scale as well as you think. Faster core/bus is almost always a win, but obviously there's a very strict limit you can go. merlin