Hi all,
I have a production database that sometimes runs out of memory at nightly vacuum.
The application runs typically with around 40 postgres connections but there are times when the connections increase because of some queries going on.
You should consider not allowing that to happen. If 40 connections is the sweet spot for your hardware, then you can't change reality simply by changing the connection pool setting.
The reason is that the operations are slow, the terminals time out and try to reconnect using new connections.
So, "The beatings will continue until morale improves". You should consider not allowing that to happen, either. If the original operation is slow, why would trying it again on a new connection (while leaving the original one behind to clog things up) be any better?
Some time ago I started to have problems with too many connections being open so I lowered the limit to 300 connections. It was all good until recently when even with 180 connections I am running out of memory on vacuum... So the connections increase to 180 and the system still runs properly for other 2 days but then at nightly vacuum runs out of memory.
The fix is to restart postgres ... If I only close the connections the problem is still these so I need to restart postgres.
How are you closing the connections?
If I don't restart postgres then the system will run out of memory on queries at a point...
Another important thing is that during vacuum at 1am nothing else is going on that server so all the connections are idle.
Truly idle, or idle in transaction, or not even that? If the "abandoned" connections have completed whatever long running thing caused them to be abandoned, then they should no longer exist. What are the largest processes according to top?
You seem to have a series of unfortunate events here, but I think you are tackling them from the wrong end. What are these slow queries that take up a lot of memory, and why can't they be improved? Given that you have slow queries that take a lot of memory, why does your app respond to this by launching a Denial of Service attack against its own database, and why do you let it do that?
Cheers,
Jeff