Re: Fedora 32 System-Wide Change proposal (late): Enable EarlyOOM

Lennart Poettering <mzerqung@xxxxxxxxxxx> · Mon, 6 Jan 2020 18:09:17 +0100

On Mo, 06.01.20 10:09, Michael Catanzaro (mcatanzaro@xxxxxxxxx) wrote:

> Is there a way to check memory usage without periodic wakeups?

PSI. It measures latency though. Which is the right thing to measure
here... You can configure thresholds there and it wakes you up when
those are hit. Thus userspace doesn't have to poll at all...

> In WebKit we wake up every 5s to check memory usage if we saw low memory
> usage on the last wakeup, or every 1s if it was high, with a scale in
> between. Would be good to experiment with the timings and see how long we
> can get away with before the polling interval is too large to prevent system
> lockups. (The WebKit timings are designed for cache clearing, not for
> maintaining system responsiveness, so I wouldn't trust those.)

Watch things with PSI.

> > 2. New code using system() in the year 2020? Really?
> >
> > 3. Fixed size buffers and implicit, undetected, truncation of strings
> >    at various places (for example, when formatting the shell string to
> >    pass to system()).
>
> Thanks. The code review is much appreciated. If we're going to be running a
> superuser deamon, then we need to be confident that it doesn't do these
> dangerous things. And these choices do raise quality concerns about what
> might be lurking in the rest of the code, as well.

BTW, this should not be a root daemon anyway. It only needs one cap:
CAP_SYS_KILL. Hence, drop privs to some user of its own, and keep that
one cap. Use AmbientCapabilities= in the unit file.

> My understanding is that experiments with PSI have indicated that it's hard
> to make it work well in practice. Alexey (hakavlad) has investigated this
> topic extensively, and his conclusion was:
>
> "PSI-based process killing should not be used by default, because this topic
> is still poorly understood and we don’t know what thresholds are desirable
> for most users: it’s hard to find good default values."

If things are poorly understood, then understand them better... Don't
just adopt some stuff that isn't much better understood either...

> > But even if we'd ignore that in order fight latencies one should watch
> > latencies: OOM killing per process is just not appropriate on a
> > systemd system: all our system services (and a good chunk of our user
> > services too) are sorted neatly into cgroups, and we really should
> > kill them as a whole and not just individual processes inside
> > them. systemd manages that today, and makes exceptions configurable
> > via OOMPolicy=, and with your earlyoom stuff you break that.
> >
> > This looks like second guessing the kernel memory management folks at
> > a place where one can only lose, and at the time breaking correct OOM
> > reporting by the kernel via cgroups and stuff.
>
> I think it's very clear at this point that this is extremely unlikely to be
> fixed at the kernel level. If that changes, great, but in the meantime we
> need a userspace solution to prevent Fedora from locking up. The Workstation
> WG doesn't have much (any?) kernel development experience, and we're aware
> that historical discussions on fixing this issue at the kernel level have
> concluded negatively, so we're limiting our interest to userspace
> solutions.

Well, it's not that the kernel folks wouldn't provide you with some
tools to improve the situation, see PSI...

> > Also: what precisely is this even supposed to do? Replace the
> > algorithm for detecting *when* to go on a kill rampage? Or actually
> > replace the algorithm selecting *what* to kill during a kill rampage?
>
> earlyoom is restricted to the former, although in the future we might be
> interested in doing the later as well, either by enhancing earlyoom or
> switching to another tool, e.g. to prevent sshd or journald from being
> killed.

These services should set the OOMScoreAdjust= value to something
sensible. journald and udevd do that. maybe ssh should do too... (it's
a bit harder for ssh, since it needs to undo the setting for its
sessions again, since oom scores are propagated down the process tree)

> > If it's the former (which the name of the project suggests,
> > _early_oom)), then at the most basic the tool should let the kernel do
> > the killing, i.e. "echo f > /proc/sysrq-trigger". That way the
> > reporting via cgroups isn't fucked, and systemd can still do its
> > thing, and the kernel can kill per cgroup rather than per process...
>
> Problem is that letting the kernel do the work can cause data loss. earlyoom
> needs to handle process termination itself so that it can send SIGTERM
> first, instead of jumping straight to SIGKILL and corrupting who knows what.

Well, then tell systemd to do it for you... Use the D-Bus call
GetUnitByPID() and then issue StopUnit().

Lennart

--
Lennart Poettering, Berlin
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx