Re: Properly handle OOM death?

Justin Pryzby <pryzby@xxxxxxxxxxxxx> · Mon, 13 Nov 2023 08:42:31 -0600

On Mon, Mar 13, 2023 at 06:43:01PM +0100, Peter J. Holzer wrote:
> On 2023-03-13 09:21:18 -0800, Israel Brewster wrote:
> > I’m running a postgresql 13 database on an Ubuntu 20.04 VM that is a bit more
> > memory constrained than I would like, such that every week or so the various
> > processes running on the machine will align badly and the OOM killer will kick
> > in, killing off postgresql, as per the following journalctl output:
> > 
> > Mar 12 04:04:23 novarupta systemd[1]: postgresql@13-main.service: A process of this unit has been killed by the OOM killer.
> > Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Failed with result 'oom-kill'.
> > Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Consumed 5d 17h 48min 24.509s CPU time.
> > 
> > And the service is no longer running.
> 
> I might be misreading this, but it looks to me that systemd detects that
> *some* process in the group was killed by the oom killer and stops the
> service.

Yeah.

I found this old message on google.  I'm surprised there aren't more,
similar complaints about this.  It's as Peter said: it (sometimes)
causes systemd to actively *stop* the cluster after OOM, when it
would've come back online on its own if the init (supervisor) process
didn't interfere.

My solution was to set:
/usr/lib/systemd/system/postgresql@.service OOMPolicy=continue

I suggest that the default unit files should do likewise.

-- 
Justin