Re: Fedora 32 System-Wide Change proposal (late): Enable EarlyOOM

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Mon, 6 Jan 2020 08:51:08 -0700

On Mon, Jan 6, 2020 at 3:08 AM Lennart Poettering <mzerqung@xxxxxxxxxxx> wrote:
>>
> Looking at the sources very superficially I see a couple of problems:
>
> 1. Waking up all the time in 100ms intervals? We generally try to
>    avoid waking the CPU up all the time if nothing happens. Saving
>    power and things.

I agree. What do you think is a reasonable interval? Given that
earlyoom won't SIGTERM until both 10% memory free and 10% swap free,
and that will take at least some seconds, what about an interval of 3
seconds?

> But more importantly: are we sure this actually operates the way we
> should? i.e. PSI is really what should be watched. It is not
> interesting who uses how much memory and triggering kills on
> that. What matters is to detect when the system becomes slow due to
> that, i.e. *latencies* introduced due to memory pressure and that's
> what PSI is about, and hence what should be used.

Earlyoom is a short term stop gap while a more sophisticated solution
is still maturing. That being low-memory-monitor, which does leverage
PSI.

> But even if we'd ignore that in order fight latencies one should watch
> latencies: OOM killing per process is just not appropriate on a
> systemd system: all our system services (and a good chunk of our user
> services too) are sorted neatly into cgroups, and we really should
> kill them as a whole and not just individual processes inside
> them. systemd manages that today, and makes exceptions configurable
> via OOMPolicy=, and with your earlyoom stuff you break that.

OOMPolicy= depends on the kernel oom-killer, which is extremely
reluctant to trigger at all. Consistently in my testing, the vast
majority of the time, kernel oom-killer takes > 30 minutes to trigger.
And it may not even kill the worst offender, but rather something like
sshd. A couple of times, I've seen it kill systemd-journald. That's
not a small problem.

earlyoom first sends SIGTERM. It's not different from the user saying,
enough of this, let's just gracefully quit the offending process. Only
if the problem continues to get worse is SIGKILL sent.

> This looks like second guessing the kernel memory management folks at
> a place where one can only lose, and at the time breaking correct OOM
> reporting by the kernel via cgroups and stuff.

It is intended to be a substitute for the user hitting the power
button. It's not intended as a substitute for the OS, as a whole,
improving its user advocacy to do the right thing in the first place,
which currently it isn't.

For now, kernel developers have made it clear they do not care about
user space responsiveness. At all. Their concern with kernel
oom-killer is strictly with keeping the kernel functioning. And the
congestion that results from heavy simultaneous page-in and page-out
also appears to not be a concern for kernel developers, it's a well
known problem, and they haven't made any break through in this area.

So it's really going to need to be user space managed, leveraging PSI
and cgroupv2. And that's the next step.

> Also: what precisely is this even supposed to do? Replace the
> algorithm for detecting *when* to go on a kill rampage? Or actually
> replace the algorithm selecting *what* to kill during a kill rampage?

a. It's never a kill rampage.
b. When: It first uses SIGTERM at 10% remaining for both memory and
swap; and SIGKILL at 5%.
In hundreds of tests I've never seen earlyoom use SIGKILL, so far
everything responds fairly immediately to SIGTERM. But I'm also
testing with well behaved programs, nothing malicious. And that's
intentional. This problem is actually far worse if it were malicious.
c. What: Same as kernel oom-killer. It uses oom_score.

It isn't replacing anything. It's acting as a user advocate by
approximating what a reasonable user would do, SIGTERM. The user can't
do this themselves because during heavy swap system responsivity is
already lost, before we're even close to OOM.

You're right, someone should absolutely solve the responsivity
problem. Kernel folks have clearly ceded this. Can it be done with
cgroupv2 and PSI alone? Unclear.

> If it's the former (which the name of the project suggests,
> _early_oom)), then at the most basic the tool should let the kernel do
> the killing, i.e. "echo f > /proc/sysrq-trigger". That way the
> reporting via cgroups isn't fucked, and systemd can still do its
> thing, and the kernel can kill per cgroup rather than per process...

That would be a killing rampage. sysrq+f issues SIGKILL and definitely
results in data loss, always. Earlyoom uses SIGTERM as a first step,
which is a much more conservative first attempt.

> Anyway, this all sounds very very fishy to me. Not thought to the end,
> and I am pretty sure this is something the kernel memory management
> folks should give a blessing to. Second guessing the kernel like that
> is just a bad idea if you ask me.

There's no first or second guessing. The kernel oom-killer is strictly
responsible for maintaining enough resources for the kernel. Not
system responsivity. The idea of user space oom management is to take
user space priorities into account, which kernel folks have rather
intentionally stayed out of answering.

> I mean, yes, the OOM killer might not be that great currently, but
> this sounds like something to fix in kernel land, and if that doesn't
> work out for some reason because kernel devs can't agree, then do it
> as fallback in userspace, but with sound input from the kernel folks,
> and the blessing of at least some of the kernel folks.

The kernel oom-killler works exactly as intended and designed to work.
It does not give either one or two shits about user space. It cares
only about proper kernel function. And to that end it's working 100%
effectively near as I can tell.

The mistake most people are making is the idea the kernel oom-killer
is intended as a user space, let alone end user, advocate. That is
what earlyoom and other user space oom managers are trying to do.

I do rather like your idea from some months ago, about moving to
systems that have a much smaller swap during normal use, and only
creating+activating a large swap at hibernation time. And after
resuming from hibernation, deactivating the large swap. That way
during normal operation, only "incidental" swap is allowed, and heavy
swap very quickly has nowhere to go. And tied to that, a way to
restrict the resources unprivileged processes get, rather than being
allowed to overcommit - something low-memory-monitor attempts to
achieve.

-- 
Chris Murphy
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx