Re: How to debug a locked backend ?

Csaba Nagy <nagy@xxxxxxxxxxxxxx> · Fri, 18 Nov 2005 17:00:06 +0100

> Well, I've had time to read your previous message too.
> 
> The first time you seem to imply the machine slowed down across all 
> processes - ssh etc. Was that the case this time?

OK, the slowdown/crash was a different problem, which might have been
caused by a "too many files open" problem combined with a huge load.
There was no evidence that the server actually crashed, but we needed to
reboot it because it became completely non-responsive. My current
problem is this:

http://archives.postgresql.org/pgsql-general/2005-11/msg00828.php

This current situation is different, the machine was almost idle, and
there was 1 UPDATE on a table running, a lot of inserts on the same
table waiting for a lock held by the transaction of the UPDATE, and a
VACUUM (started by autovacuum) running on the same table. There was no
dead-lock detected by the system, and I guess it should have detected it
if it was one, cause this whole thing was running for more than 3 hours
according to pg_stat_activity, and I set the deadlock timeout to 2
seconds.

So this happened 2 days ago, and today it happened again with exactly
the same picture: update, inserts, (auto)vacuum, on exactly the same
table as 2 days ago. That can't be a coincidence, there is something bad
either with our code or postgres. I do suspect that our code is doing
something not completely kosher, but it still should not lock in this
way.

Now according to pg_locks, the transaction of the update was holding an
exclusive lock on the table, which I can't explain, as we don't acquire
any table lock in our whole application, and the SQL of the update is
coming from our application, I could identify the process it's part of.
The only explicit locking we do in that process is via a SELECT ... FOR
UPDATE on that table, but that should place no exclusive lock on the
table, right ? So where is that lock coming from ?

And then, why was I not able to cancel the backend via kill or
pg_cancel_backend ?

> When you say "locked" do you mean it was waiting on locks, was using all 
> the CPU, unresponsive or just taking the query a long time?
> 
> To prepare for next time I'd:
> 1. Leave ssh logged-in, run screen to get three sessions
> 2. Leave "top" running in the first - that'll show you process 
> activity/general load
> 3. Run "vmstat 10" in the second - that'll show you overall 
> memory/swap/disk/cpu usage.
> 4. The third session is then free to work in, if neither of the first 
> two show anything useful.

In that particular case I had a few terminals open, one of them actually
running top and others free. They were not responding either, so the
machine was completely bogged down. I do routinely have a few terminals
open, a psql one included. But that kind of lock-down did not happen
anymore, and we changed server in the meantime.

Cheers,
Csaba.

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org