Re: backend hangs at sendto() and can't be terminated

Steven Schlansker <steven@xxxxxxxxxxxx> · Thu, 11 Jul 2013 17:22:58 -0700

On Jul 8, 2013, at 6:48 AM, Jov <amutu@xxxxxxxxx> wrote:

> netstat show nothing about the socket of the process,so I think the TCP timeout took effect.so it is really wired.
> 
> Jov
> blog: http:amutu.com/blog
> 
> 
> 2013/7/8 Tom Lane <tgl@xxxxxxxxxxxxx>
> Merlin Moncure <mmoncure@xxxxxxxxx> writes:
> > On Mon, Jul 8, 2013 at 4:56 AM, Jov <amutu@xxxxxxxxx> wrote:
> >> my first post already try the pg_terminate_backend but failed:
> >> pg_terminate_backend return t but the backend still there.
> 
> > possibly a kernel problem?
> 
> The backend will keep trying to send data until the kernel informs it
> the connection is lost.  (Anything else would be a bad idea.)  So the
> real question here is why it's taking so long for the TCP stack to
> decide that the client is gone.  I'm wondering what exactly you did
> to kill the psql session.  Most ordinary ways of killing a process
> should result in closure of whatever connections it had open.
> 
> If you'd lost network connectivity to the client, a TCP timeout on the
> order of an hour wouldn't be surprising.  (If you feel this is too long,
> you can fool with the TCP keepalive parameters.)  But it seems unlikely
> that that's what's happening here.

Interestingly enough, I am seeing what may (or then again, may not) be a related problem.

I have a backend process stuck in a "recvfrom" --
[root@prd-db2a ~]# strace -tv -p 24402
Process 24402 attached - interrupt to quit
00:02:00 recvfrom(10, 

postgres 24402  0.0 10.7 3505628 2639032 ?     Ss   Jul01   0:21 postgres: event event 10.29.62.21(39485) idle

It is a psql process that I launched from the command line 10 days ago:

steven   24401  0.0  0.0 166824  2532 pts/2    T    Jul01   0:00 psql -U event -h prd-db2a.nessops.net -c delete from event where event_id in (select event_id from event where payload is null limit 100000);

event=# select pid,application_name,backend_start,waiting,state,query from pg_stat_activity where pid=24402;
  pid  | application_name |         backend_start         | waiting | state |                                                query                                   

-------+------------------+-------------------------------+---------+-------+----------------------------------------------------------------------------------------
--------------
 24402 | psql             | 2013-07-01 21:03:27.417039+00 | f       | idle  | delete from event where event_id in (select event_id from event where payload is null l
imit 100000);
(1 row)

I invoked it with -c, which supposedly makes it exit when the single command is finished.  Many similar queries have been run, and I'd say they run for a half hour on average.

But the process has been alive for 10 days now, not blocked, but just idle.  It is connected over TCP from the local box (although not through the loopback interface, through a 10.x interface)  How does this make any sense?

The command seems to be immune to pg_cancel_backend, but pg_terminate_backend did manage to kill it.

If this problem is not related, please tell me to shove off, and I will not pollute this thread further.  But I hope some of this information is useful.

I am running PG 9.2.4, CentOS kernel 2.6.32.360.

Best,
Steven

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general