Re: Sudden slowdown of Pg server

"Jignesh K. Shah" <J.K.Shah@xxxxxxx> · Fri, 20 Jan 2006 17:13:58 -0500

lockstat is available in Solaris 9. That can help you to determine if 
there are any kernel level locks that are occuring during that time.
Solaris 10 also has plockstat which can be used to identify userland 
locks happening in your process.

Since you have Solaris 9, try the following:

You can run (as root)
lockstat sleep 5 
and note the output which can be long.

I guess "prstat -am" output, "iostat -xczn 3", "vmstat 3" outputs will 
help also.

prstat -am has a column called "LAT", if the value is in double digits, 
then you have a locking issue which will probably result in higher "SLP" 
value for the process.  (Interpretation is data and workload specific 
which this email is too small to decode)

Once you have identified a particular process (if any) to be the source 
of the problem, get its id and you can look at the outputs of following 
command which (quite intrusive)
truss -c -p $pid   2> truss-syscount.txt

(Ctrl-C after a while to stop collecting)

truss -a -e -u":::" -p $pid 2> trussout.txt

(Ctrl-C after a while to stop collecting)

Regards,
Jignesh

Jerry Sievers wrote:

Hello;  

I am going through a post mortem analysis of an infrequent but
recurring problem on a Pg 8.0.3 installation.  Application code
connects to Pg using J2EE pooled connections.

PostgreSQL 8.0.3 on sparc-sun-solaris2.9, compiled by GCC sparc-sun-solaris2.8-gcc (GCC) 3.3.2

Database is quite large with respect to the number of tables, some of
which have up to 6 million tuples.  Typical idle/busy connection ratio
is 3/100 but occationally we'll catch 20 or more busy sessions.

The problem manifests itself and appears like a locking issue.  About
weekly throuput slows down and we notice the busy connection count
rising minute by minute.  2, 20, 40...  Before long, the app server
detects lack of responsiveness and fails over to another app server
(not Pg) which in turn attempts a bunch of new connections into
Postgres.

Sampling of the snapshots of pg_locks and pg_stat_activity tables
takes place each minute.

I am wishing for a few new ideas as to what to be watching; Here's
some observations that I've made.

1. At no time do any UN-granted locks show in pg_locks
2. The number of exclusive locks is small 1, 4, 8
3. Other locks type/mode are numerous but appear like normal workload.
4. There are at   least a few old '<IDLE> In Transaction' cases in
  activity view
5. No interesting error messages or warning in Pg logs.
6. No crash of Pg backend

Other goodies includes a bounty of poor performing queries which are
constantly being optimized now for good measure.  Aside from the heavy
queries, performance is generallly decent.

Resource related server configs have been boosted substantially but
have not undergone any formal R&D to verify that we're inthe safe
under heavy load.

An max_fsm_relations setting which is *below* our table and index
count was discovered by me today and will be increased this evening
during a maint cycle.

The slowdown and subsequent run-away app server takes place within a
small 2-5 minute window and I have as of yet not been able to get into
Psql during the event for a hands-on look.

Questions;

1. Is there any type of resource lock that can unconditionally block
  another session and NOT appear as UN-granted lock?

2. What in particular other runtime info would be most useful to
  sample here?

3. What Solaris side runtime stats might give some clues here
  (maybe?)( and how often to sample?  Assume needs to be aggressive
  due to how fast this problem crops up.

Any help appreciated

Thank you