Re: Fedora 32 System-Wide Change proposal (late): Enable EarlyOOM

Lennart Poettering <mzerqung@xxxxxxxxxxx> · Mon, 6 Jan 2020 17:47:38 +0100

On Mo, 06.01.20 08:51, Chris Murphy (lists@xxxxxxxxxxxxxxxxx) wrote:

> On Mon, Jan 6, 2020 at 3:08 AM Lennart Poettering <mzerqung@xxxxxxxxxxx> wrote:
> >>
> > Looking at the sources very superficially I see a couple of problems:
> >
> > 1. Waking up all the time in 100ms intervals? We generally try to
> >    avoid waking the CPU up all the time if nothing happens. Saving
> >    power and things.
>
> I agree. What do you think is a reasonable interval? Given that
> earlyoom won't SIGTERM until both 10% memory free and 10% swap free,
> and that will take at least some seconds, what about an interval of 3
> seconds?

None. Use PSI. It wakes you up only when pressure stalls reach
threshold you declare. Which basically means you never steal the CPUs
on an idle system, you never cause a wakeup whatsoever.

> > But more importantly: are we sure this actually operates the way we
> > should? i.e. PSI is really what should be watched. It is not
> > interesting who uses how much memory and triggering kills on
> > that. What matters is to detect when the system becomes slow due to
> > that, i.e. *latencies* introduced due to memory pressure and that's
> > what PSI is about, and hence what should be used.
>
> Earlyoom is a short term stop gap while a more sophisticated solution
> is still maturing. That being low-memory-monitor, which does leverage
> PSI.

Yes, l-m-m is great. If we can deploy l-m-m today already, why isn't
it good enoug for earlyoom?

> > But even if we'd ignore that in order fight latencies one should watch
> > latencies: OOM killing per process is just not appropriate on a
> > systemd system: all our system services (and a good chunk of our user
> > services too) are sorted neatly into cgroups, and we really should
> > kill them as a whole and not just individual processes inside
> > them. systemd manages that today, and makes exceptions configurable
> > via OOMPolicy=, and with your earlyoom stuff you break that.
>
> OOMPolicy= depends on the kernel oom-killer, which is extremely
> reluctant to trigger at all. Consistently in my testing, the vast
> majority of the time, kernel oom-killer takes > 30 minutes to trigger.
> And it may not even kill the worst offender, but rather something like
> sshd. A couple of times, I've seen it kill systemd-journald. That's
> not a small problem.

Well, that sounds as if OOMScoreAdjust= of these services should be
tweaked. In journald we us OOMScoreAdjust=-250 and in udevd
OOMScoreAdjust=-1000.

If journald is still killed too likely, we can certainly bump it to
-900 or so, please file a bug.

> earlyoom first sends SIGTERM. It's not different from the user saying,
> enough of this, let's just gracefully quit the offending process. Only
> if the problem continues to get worse is SIGKILL sent.

This sounds as if you want low-memory-monitor, but for all services,
right?

Sounds like something that is relatively easily implementable in
systemd though, in a much better way, i.e. hooked to PSI...

> For now, kernel developers have made it clear they do not care about
> user space responsiveness. At all. Their concern with kernel

References to this? I mean, the kernel developers are not a single
person, they tend to have different opinions...

> > Also: what precisely is this even supposed to do? Replace the
> > algorithm for detecting *when* to go on a kill rampage? Or actually
> > replace the algorithm selecting *what* to kill during a kill rampage?
>
> a. It's never a kill rampage.

it calls kill(), which I call a "kill rampage"...

> It isn't replacing anything. It's acting as a user advocate by
> approximating what a reasonable user would do, SIGTERM. The user can't
> do this themselves because during heavy swap system responsivity is
> already lost, before we're even close to OOM.
>
> You're right, someone should absolutely solve the responsivity
> problem. Kernel folks have clearly ceded this. Can it be done with
> cgroupv2 and PSI alone? Unclear.

Sounds like someone needs to do their homework, if this is "unclear"?

I mean, you basically admit here that this isn't really figured out to
the end. Maybe let's give this a bit more time and figure things out a
bit more, instead of rushing earlyoom in?

Adopting something now, at a point we already clearly know that PSI is
how this should be done sounds very wrong to me.

> That would be a killing rampage. sysrq+f issues SIGKILL and definitely
> results in data loss, always. Earlyoom uses SIGTERM as a first step,
> which is a much more conservative first attempt.

But it sends SIGKILL next? Why? Why not sysrq+f trggred from userspace
for that?

I must say the idea that there are effectively multiple process
babysitters now, which both want to decide when to terminate services
sounds very wrong to me...

I mean, wouldn't this all be solved much nicer, much more future
proof, if someone would just do what l-m-m does as part of systemd
service management, i.e. let's say an option StopOnMemoryPressure=
that watches PSI and terminates services *cleanly* when needed,
i.e. goes through ExecStop= and such?

And you know what, PSI is precisely defined to be used for purposes
like this, we already have experience with it (see l-m-m) and a patch
adding this to systemd isn#t really that hard either...

> I do rather like your idea from some months ago, about moving to
> systems that have a much smaller swap during normal use, and only
> creating+activating a large swap at hibernation time. And after
> resuming from hibernation, deactivating the large swap. That way
> during normal operation, only "incidental" swap is allowed, and heavy
> swap very quickly has nowhere to go. And tied to that, a way to
> restrict the resources unprivileged processes get, rather than being
> allowed to overcommit - something low-memory-monitor attempts to
> achieve.

Memory paging doesn't just mean swapping, i.e. writing stuff to and
reading stuff from a swap partition of some kind. Paging also means
that program code is memory mapped from binary files and can be loaded
into memory and unloaded any time since it can be re-read whenever it
is needed. Thus, you should be able to get under memory pressure even
without swap simply because the program code of the programs you run
needs to be paged in/out all the time from your rootfs, rather than
from a swap partition...

Anyway, still not convinced having this is a good idea. There's a lot
of homework to be done first...

Lennart

--
Lennart Poettering, Berlin
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx