Csaba Nagy wrote:
On Thu, 2009-07-30 at 11:41 +0200, Greg Stark wrote:
I know this is a popular feeling. But you're throwing away decades of
work in making TCP reliable. You would change feelings quickly if you
ever faced this scenario too. All it takes is some bad memory or a bad
wire and you would be turning a performance drain into random
connection drops.
But if I get bad memory or bad wire I'll get much worse problems
already, and don't tell me it will work more reliably if you don't kill
the connection. It's a lot better to find out sooner that you have those
problems and fix them than having spurious errors which you'll get even
if you don't kill the connection in case of such problems.
Transient connection issues are not infrequent, and shouldn't promptly
kill connections.
A user's wifi might drop and then re-establish service. They might bump
the Ethernet cable out (and it's inevitably lost its retaining clip). A
router _somewhere_ along the route might reboot. Etc.
That said, TCP keepalives are designed to allow for this, and only
consider the connection dead if it's failed to respond for a reasonable
period and hasn't acknowledged several requests.
Well it lived for at least one hour (could be more, I don't remember for
sure) keeping vacuum from doing it's job on a heavily updated DB.
Unless you've changed the defaults, TCP keepalives will take several
hours to notice a dead connection - if they're enabled at all.
It was
not so much about my patience as about starting to have abysmal
performance, AFTER we fixed the initial cause of the crash, and without
any warning, except of course I did find out immediately that bloat
happens and found the idle transactions
Idle? I thought your issue was _active_ queries running, servicing
requests from clients that'd since ceased to care?
How did you manage to kill the client in such a way as that the OS on
the client didn't send a FIN to the server anyway? Hard-reset the client
machine(s)?
and killed them, but I imagine
the hair-pulling for a less experienced postgres DBA. I would have also
preferred that postgres solves this issue on it's own - the network
stack is clearly not fast enough in resolving it.
It's not really meant to happen in the first place. I do think that if
you have a lot of connections from unreliable machines (say hosts with
intermittent connectivity) then you'd want to make sure tcp keepalives
are active and that you've tuned the keepalive params to be much more
aggressive.
I thought your issue was the backend not terminating a query when the
client died while the backend was in the middle of a long-running query.
Keepalives alone won't solve that one.
--
Craig Ringer
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general