Re: Fedora 32 System-Wide Change proposal (late): Enable EarlyOOM

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Mon, Jan 6, 2020 at 11:07 am, Lennart Poettering <mzerqung@xxxxxxxxxxx> wrote:
Hmm, are we sure this is something we want to have in the default
install? Is the code really good enough for that?

Looking at the sources very superficially I see a couple of problems:

1. Waking up all the time in 100ms intervals? We generally try to
   avoid waking the CPU up all the time if nothing happens. Saving
   power and things.

Is there a way to check memory usage without periodic wakeups?

In WebKit we wake up every 5s to check memory usage if we saw low memory usage on the last wakeup, or every 1s if it was high, with a scale in between. Would be good to experiment with the timings and see how long we can get away with before the polling interval is too large to prevent system lockups. (The WebKit timings are designed for cache clearing, not for maintaining system responsiveness, so I wouldn't trust those.)

2. New code using system() in the year 2020? Really?

3. Fixed size buffers and implicit, undetected, truncation of strings
   at various places (for example, when formatting the shell string to
   pass to system()).

Thanks. The code review is much appreciated. If we're going to be running a superuser deamon, then we need to be confident that it doesn't do these dangerous things. And these choices do raise quality concerns about what might be lurking in the rest of the code, as well.

But more importantly: are we sure this actually operates the way we
should? i.e. PSI is really what should be watched. It is not
interesting who uses how much memory and triggering kills on
that. What matters is to detect when the system becomes slow due to
that, i.e. *latencies* introduced due to memory pressure and that's
what PSI is about, and hence what should be used.

My understanding is that experiments with PSI have indicated that it's hard to make it work well in practice. Alexey (hakavlad) has investigated this topic extensively, and his conclusion was:

"PSI-based process killing should not be used by default, because this topic is still poorly understood and we don’t know what thresholds are desirable for most users: it’s hard to find good default values."

https://pagure.io/fedora-workstation/issue/98#comment-615425

Details at: https://github.com/rfjakob/earlyoom/issues/100

So: already considered, but rejected for now.

But even if we'd ignore that in order fight latencies one should watch
latencies: OOM killing per process is just not appropriate on a
systemd system: all our system services (and a good chunk of our user
services too) are sorted neatly into cgroups, and we really should
kill them as a whole and not just individual processes inside
them. systemd manages that today, and makes exceptions configurable
via OOMPolicy=, and with your earlyoom stuff you break that.

This looks like second guessing the kernel memory management folks at
a place where one can only lose, and at the time breaking correct OOM
reporting by the kernel via cgroups and stuff.

I think it's very clear at this point that this is extremely unlikely to be fixed at the kernel level. If that changes, great, but in the meantime we need a userspace solution to prevent Fedora from locking up. The Workstation WG doesn't have much (any?) kernel development experience, and we're aware that historical discussions on fixing this issue at the kernel level have concluded negatively, so we're limiting our interest to userspace solutions.

I think everybody would be happy to hold off on userspace solutions if a kernel solution is in the works. I'd love to see kernel devs acknowledge the issue, using the same test cases that we're using (either 'ninja build' WebKit or simply 'tail /dev/zero'), and propose a real concrete solution. But I'm not going to hold my breath for that. My understanding is that previous discussions have concluded that the kernel OOM is designed to ensure enough memory remains available to the kernel, and that userspace is responsible for determining how to keep userspace responsive.

Also: what precisely is this even supposed to do? Replace the
algorithm for detecting *when* to go on a kill rampage? Or actually
replace the algorithm selecting *what* to kill during a kill rampage?

earlyoom is restricted to the former, although in the future we might be interested in doing the later as well, either by enhancing earlyoom or switching to another tool, e.g. to prevent sshd or journald from being killed.

If it's the former (which the name of the project suggests,
_early_oom)), then at the most basic the tool should let the kernel do
the killing, i.e. "echo f > /proc/sysrq-trigger". That way the
reporting via cgroups isn't fucked, and systemd can still do its
thing, and the kernel can kill per cgroup rather than per process...

Problem is that letting the kernel do the work can cause data loss. earlyoom needs to handle process termination itself so that it can send SIGTERM first, instead of jumping straight to SIGKILL and corrupting who knows what.

Michael

_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Fedora Announce]     [Fedora Users]     [Fedora Kernel]     [Fedora Testing]     [Fedora Formulas]     [Fedora PHP Devel]     [Kernel Development]     [Fedora Legacy]     [Fedora Maintainers]     [Fedora Desktop]     [PAM]     [Red Hat Development]     [Gimp]     [Yosemite News]

  Powered by Linux