Re: postgress 8.1.4 deadlocking??

Rafal Pietrak <rafal@xxxxxxxxxxxxxxxxxx> · Wed, 27 Sep 2006 22:00:13 +0200

Ups, missed the list recepient itself.

Thenx Duncan for the analysis.

This happend again, so I'm able to peek at the details you've pointed
out.

On Wed, 2006-09-27 at 09:33 -0700, Casey Duncan wrote:
> Sounds like it was blocked (unsure by what). You can use pg_locks to  
> check that.

That view currently contais 240 entries .... source of blocking. Now I
have to dig the ultimate reason why those locks end up there (and are
not being cleand). Any ideas?

Whould those locks go away by themselves .... in time?

> What was consuming all the CPU? I assume from what you wrote  
> previously that it is a postgres backend. If so, what was it doing?  

----------------ps output (just one backend)------------
postgres  4759  0.0  2.0  19368  5292 ?        S    21:02   0:00
postgres: czyn_738773 czyn 127.0.0.1(55431) UPDATE waiting  
-------------------------------------------------------

"czyn" is the name of the database, "czyn_738773" is database username.
IP address is localhost, since the database purpose is web serbvice
back-end.

Still, I'm not sure what that UPDATE counld that be.

> You can use ps, the pg_stat_activity view and strace to figure that out.

------------- exerpt from pg_stat_activity-----------------
 17660 | czyn |    4759 |    19541 | czyn_738773 | <command string not
enabled> |             | 2006-09-27 21:02:05.914728+02 | 127.0.0.1   |
55431
(18 rows)
------------------------------------------

But this is not very meaningfull to me.

> > ...another postgress-process restart, and another VACCUM ANALYSE, this
> > one completes in 3-5 secs. Now "ALTER GROUP ..." goes OK. And  
> > everything
> > seams working.
> 
> Was something else using the database previously? Was something else  
> different this time than the last?

No. My impression was (naive, I know), that not all things were
correctly cleand-up on the first restart. Now I think, that on the
second go I was just a little bit more lucky - www service was calm for
a little longer. That was anough for VACUUM to complete.

Still, throuble started later.

> How big is the data on disk? Is it possible that you have queries  

close to nothing: c.a. 200MiB

> problem (hardware, os, fs). When you saw this happen did you happen  
> to check if it was user or system cpu usage? I had a corrupt  

It's the user CPU time. 

As it currently looks like some unknown UPDATE causes the trouble, how
can I identify what it is?
1. the database is not very sophisticated, apart from what I've
mentioned earlier, it does have just a few stored procedures in pgsql -
nothing fancy. And there is no triggers just specialized views (although
quite a lot of referencial integrity constrains).
2. machine is 'from the old days': 450MHz CPU
3. the system is currently under 'available-for-public' condition, so
there may be 'waves' (peeks) of activity.
4. when the system was tested in-lab (small number of users - no risk /
not-tested for havier load behavior), I haven't noticed this 'blocking'.

Any ideas where should I look now, to pinpoint the problem?
-- 
-R