Sudden slowdown of Pg server

Jerry Sievers <jerry@xxxxxxxxxxxxxxxx> · 20 Jan 2006 16:42:20 -0500

Hello;  

I am going through a post mortem analysis of an infrequent but
recurring problem on a Pg 8.0.3 installation.  Application code
connects to Pg using J2EE pooled connections.

 PostgreSQL 8.0.3 on sparc-sun-solaris2.9, compiled by GCC sparc-sun-solaris2.8-gcc (GCC) 3.3.2

Database is quite large with respect to the number of tables, some of
which have up to 6 million tuples.  Typical idle/busy connection ratio
is 3/100 but occationally we'll catch 20 or more busy sessions.

The problem manifests itself and appears like a locking issue.  About
weekly throuput slows down and we notice the busy connection count
rising minute by minute.  2, 20, 40...  Before long, the app server
detects lack of responsiveness and fails over to another app server
(not Pg) which in turn attempts a bunch of new connections into
Postgres.

Sampling of the snapshots of pg_locks and pg_stat_activity tables
takes place each minute.

I am wishing for a few new ideas as to what to be watching; Here's
some observations that I've made.

1. At no time do any UN-granted locks show in pg_locks
2. The number of exclusive locks is small 1, 4, 8
3. Other locks type/mode are numerous but appear like normal workload.
4. There are at   least a few old '<IDLE> In Transaction' cases in
   activity view
5. No interesting error messages or warning in Pg logs.
6. No crash of Pg backend

Other goodies includes a bounty of poor performing queries which are
constantly being optimized now for good measure.  Aside from the heavy
queries, performance is generallly decent.

Resource related server configs have been boosted substantially but
have not undergone any formal R&D to verify that we're inthe safe
under heavy load.

An max_fsm_relations setting which is *below* our table and index
count was discovered by me today and will be increased this evening
during a maint cycle.

The slowdown and subsequent run-away app server takes place within a
small 2-5 minute window and I have as of yet not been able to get into
Psql during the event for a hands-on look.

Questions;

1. Is there any type of resource lock that can unconditionally block
   another session and NOT appear as UN-granted lock?

2. What in particular other runtime info would be most useful to
   sample here?

3. What Solaris side runtime stats might give some clues here
   (maybe?)( and how often to sample?  Assume needs to be aggressive
   due to how fast this problem crops up.

Any help appreciated

Thank you

-- 
-------------------------------------------------------------------------------
Jerry Sievers   305 854-3001 (home)     WWW ECommerce Consultant
                305 321-1144 (mobile	http://www.JerrySievers.com/