On Tue, Aug 26, 2014 at 12:22 PM, Dhruv Shukla <dhruvshukla82@xxxxxxxxx> wrote: > Its 15 hours now ... that the DB was restarted and things have started to > get stuck. Apparently taking too long to finish with these settings.... any > further suggesstions?? Troubleshoot it while it's stuck. If your app isn't stopping / erroring out when it loses connection then it's broken and someone needs to code real error handling into it (or you're using a language that's fundamentally broken in terms of handling network errors). Esp because with a lower tcp keepalive the app should be told that the conn died in < 10 minutes. So I'm going on the assumption that you're losing connection. YOU need to figure out why. tools like netstat and strace etc are useful here. If a backend is crashing out, there'll be an entry in the pg logs, if networking is killing it then maybe a firewall will have logs, if the oom is killing it then the linux logs on the db server will say. Use tools like sar and sysstat and zabbix and other monitoring packages to see if you're running out of ram and oom is killing processes. I assume you've lowered your work_mem etc down to something more reasonable, like 16Meg etc. And that you restarted the server after dropping max conns down to 200. Note that 200 is still far too many, and you need to look into a pooler to reduce that number down to < 2 x CPU cores. Anything over that is counter productive and likely to cause performance issues. Using netstat -an can you find matching connections from the stalled machine to the db server? If not you've lost network connection. If there's no obvious cause in pg or sys logs on the db server then it's networking. -- Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-admin