Re: Properly handle OOM death?

Israel Brewster <ijbrewster@xxxxxxxxxx> · Mon, 13 Mar 2023 09:55:50 -0800

> On Mar 13, 2023, at 9:43 AM, Peter J. Holzer <hjp-pgsql@xxxxxx> wrote:
> 
> On 2023-03-13 09:21:18 -0800, Israel Brewster wrote:
>> I’m running a postgresql 13 database on an Ubuntu 20.04 VM that is a bit more
>> memory constrained than I would like, such that every week or so the various
>> processes running on the machine will align badly and the OOM killer will kick
>> in, killing off postgresql, as per the following journalctl output:
>> 
>> Mar 12 04:04:23 novarupta systemd[1]: postgresql@13-main.service: A process of
>> this unit has been killed by the OOM killer.
>> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Failed with
>> result 'oom-kill'.
>> Mar 12 04:04:32 novarupta systemd[1]: postgresql@13-main.service: Consumed 5d
>> 17h 48min 24.509s CPU time.
>> 
>> And the service is no longer running.
> 
> I might be misreading this, but it looks to me that systemd detects that
> *some* process in the group was killed by the oom killer and stops the
> service.
> 
> Can you check which process was actually killed? If it's not the
> postmaster, setting OOMScoreAdjust is probably useless.
> 
> (I tried searching the web for the error messages and didn't find
> anything useful)

Your guess is as good as (if not better than) mine. I can find the PID of the killed process in the system log, but without knowing what the PID of postmaster and the child processes were prior to the kill, I’m not sure that helps much. Though for what it’s worth, I do note the following about all the kill logs:

1) They reference a “Memory cgroup out of memory”, which refers back to the opening comment on Joe Conway’s message - this would imply to me that I *AM* running with a cgroup memory.limit set. Not sure how that changes things?
2) All the entries contain the line "oom_score_adj:0”, which would seem to imply that the postmaster, with its -900 score is not being directly targeted by the OOM killer.

> 
>> 2) My first thought was to simply have systemd restart postgresql whenever it
>> is killed like this, which is easy enough. Then I looked at the default unit
>> file, and found these lines:
>> 
>> # prevent OOM killer from choosing the postmaster (individual backends will
>> # reset the score to 0)
>> OOMScoreAdjust=-900
>> # restarting automatically will prevent "pg_ctlcluster ... stop" from working,
>> # so we disable it here.
> 
> I never call pg_ctlcluster directly, so that probably wouldn't be a good
> reason for me.

Valid point, unless something under-the-hood needs to call it?

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

> 
>> Also, the postmaster will restart by itself on most
>> # problems anyway, so it is questionable if one wants to enable external
>> # automatic restarts.
>> #Restart=on-failure
> 
> So I'd try this despite the comment.
> 
>        hp
> 
> -- 
>   _  | Peter J. Holzer    | Story must make more sense than reality.
> |_|_) |                    |
> | |   | hjp@xxxxxx         |    -- Charles Stross, "Creative writing
> __/   | http://www.hjp.at/ |       challenge!"