Re: Clients disconnect but query still runs

Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> · Thu, 30 Jul 2009 19:29:54 +0800

Csaba Nagy wrote:
On Thu, 2009-07-30 at 11:41 +0200, Greg Stark wrote:
I know this is a popular feeling. But you're throwing away decades of
work in making TCP reliable. You would change feelings quickly if you
ever faced this scenario too. All it takes is some bad memory or a bad
wire and you would be turning a performance drain into random
connection drops.

But if I get bad memory or bad wire I'll get much worse problems
already, and don't tell me it will work more reliably if you don't kill
the connection. It's a lot better to find out sooner that you have those
problems and fix them than having spurious errors which you'll get even
if you don't kill the connection in case of such problems.

Transient connection issues are not infrequent, and shouldn't promptly 
kill connections.

A user's wifi might drop and then re-establish service. They might bump 
the Ethernet cable out (and it's inevitably lost its retaining clip). A 
router _somewhere_ along the route might reboot. Etc.

That said, TCP keepalives are designed to allow for this, and only 
consider the connection dead if it's failed to respond for a reasonable 
period and hasn't acknowledged several requests.

Well it lived for at least one hour (could be more, I don't remember for
sure) keeping vacuum from doing it's job on a heavily updated DB.

Unless you've changed the defaults, TCP keepalives will take several 
hours to notice a dead connection - if they're enabled at all.

It was
not so much about my patience as about starting to have abysmal
performance, AFTER we fixed the initial cause of the crash, and without
any warning, except of course I did find out immediately that bloat
happens and found the idle transactions

Idle? I thought your issue was _active_ queries running, servicing 
requests from clients that'd since ceased to care?

How did you manage to kill the client in such a way as that the OS on 
the client didn't send a FIN to the server anyway? Hard-reset the client 
machine(s)?

and killed them, but I imagine
the hair-pulling for a less experienced postgres DBA. I would have also
preferred that postgres solves this issue on it's own - the network
stack is clearly not fast enough in resolving it.

It's not really meant to happen in the first place. I do think that if 
you have a lot of connections from unreliable machines (say hosts with 
intermittent connectivity) then you'd want to make sure tcp keepalives 
are active and that you've tuned the keepalive params to be much more 
aggressive.

I thought your issue was the backend not terminating a query when the 
client died while the backend was in the middle of a long-running query. 
Keepalives alone won't solve that one.

--
Craig Ringer

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general