Re: Problems with PG 9.3

Scott Marlowe <scott.marlowe@xxxxxxxxx> · Tue, 26 Aug 2014 13:01:33 -0600

On Tue, Aug 26, 2014 at 12:22 PM, Dhruv Shukla <dhruvshukla82@xxxxxxxxx> wrote:
> Its 15 hours now ... that the DB was restarted and things have started to
> get stuck. Apparently taking too long to finish with these settings.... any
> further suggesstions??

Troubleshoot it while it's stuck. If your app isn't stopping /
erroring out when it loses connection then it's broken and someone
needs to code real error handling into it (or you're using a language
that's fundamentally broken in terms of handling network errors). Esp
because with a lower tcp keepalive the app should be told that the
conn died in < 10 minutes.

So I'm going on the assumption that you're losing connection. YOU need
to figure out why. tools like netstat and strace etc are useful here.
If a backend is crashing out, there'll be an entry in the pg logs, if
networking is killing it then maybe a firewall will have logs, if the
oom is killing it then the linux logs on the db server will say. Use
tools like sar and sysstat and zabbix and other monitoring packages to
see if you're running out of ram and oom is killing processes.

I assume you've lowered your work_mem etc down to something more
reasonable, like 16Meg etc. And that you restarted the server after
dropping max conns down to 200. Note that 200 is still far too many,
and you need to look into a pooler to reduce that number down to < 2 x
CPU cores. Anything over that is counter productive and likely to
cause performance issues.

Using netstat -an can you find matching connections from the stalled
machine to the db server? If not you've lost network connection. If
there's no obvious cause in pg or sys logs on the db server then it's
networking.

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin