On Mon, Jan 6, 2020 at 3:08 AM Lennart Poettering <mzerqung@xxxxxxxxxxx> wrote: >> > Looking at the sources very superficially I see a couple of problems: > > 1. Waking up all the time in 100ms intervals? We generally try to > avoid waking the CPU up all the time if nothing happens. Saving > power and things. I agree. What do you think is a reasonable interval? Given that earlyoom won't SIGTERM until both 10% memory free and 10% swap free, and that will take at least some seconds, what about an interval of 3 seconds? > But more importantly: are we sure this actually operates the way we > should? i.e. PSI is really what should be watched. It is not > interesting who uses how much memory and triggering kills on > that. What matters is to detect when the system becomes slow due to > that, i.e. *latencies* introduced due to memory pressure and that's > what PSI is about, and hence what should be used. Earlyoom is a short term stop gap while a more sophisticated solution is still maturing. That being low-memory-monitor, which does leverage PSI. > But even if we'd ignore that in order fight latencies one should watch > latencies: OOM killing per process is just not appropriate on a > systemd system: all our system services (and a good chunk of our user > services too) are sorted neatly into cgroups, and we really should > kill them as a whole and not just individual processes inside > them. systemd manages that today, and makes exceptions configurable > via OOMPolicy=, and with your earlyoom stuff you break that. OOMPolicy= depends on the kernel oom-killer, which is extremely reluctant to trigger at all. Consistently in my testing, the vast majority of the time, kernel oom-killer takes > 30 minutes to trigger. And it may not even kill the worst offender, but rather something like sshd. A couple of times, I've seen it kill systemd-journald. That's not a small problem. earlyoom first sends SIGTERM. It's not different from the user saying, enough of this, let's just gracefully quit the offending process. Only if the problem continues to get worse is SIGKILL sent. > This looks like second guessing the kernel memory management folks at > a place where one can only lose, and at the time breaking correct OOM > reporting by the kernel via cgroups and stuff. It is intended to be a substitute for the user hitting the power button. It's not intended as a substitute for the OS, as a whole, improving its user advocacy to do the right thing in the first place, which currently it isn't. For now, kernel developers have made it clear they do not care about user space responsiveness. At all. Their concern with kernel oom-killer is strictly with keeping the kernel functioning. And the congestion that results from heavy simultaneous page-in and page-out also appears to not be a concern for kernel developers, it's a well known problem, and they haven't made any break through in this area. So it's really going to need to be user space managed, leveraging PSI and cgroupv2. And that's the next step. > Also: what precisely is this even supposed to do? Replace the > algorithm for detecting *when* to go on a kill rampage? Or actually > replace the algorithm selecting *what* to kill during a kill rampage? a. It's never a kill rampage. b. When: It first uses SIGTERM at 10% remaining for both memory and swap; and SIGKILL at 5%. In hundreds of tests I've never seen earlyoom use SIGKILL, so far everything responds fairly immediately to SIGTERM. But I'm also testing with well behaved programs, nothing malicious. And that's intentional. This problem is actually far worse if it were malicious. c. What: Same as kernel oom-killer. It uses oom_score. It isn't replacing anything. It's acting as a user advocate by approximating what a reasonable user would do, SIGTERM. The user can't do this themselves because during heavy swap system responsivity is already lost, before we're even close to OOM. You're right, someone should absolutely solve the responsivity problem. Kernel folks have clearly ceded this. Can it be done with cgroupv2 and PSI alone? Unclear. > If it's the former (which the name of the project suggests, > _early_oom)), then at the most basic the tool should let the kernel do > the killing, i.e. "echo f > /proc/sysrq-trigger". That way the > reporting via cgroups isn't fucked, and systemd can still do its > thing, and the kernel can kill per cgroup rather than per process... That would be a killing rampage. sysrq+f issues SIGKILL and definitely results in data loss, always. Earlyoom uses SIGTERM as a first step, which is a much more conservative first attempt. > Anyway, this all sounds very very fishy to me. Not thought to the end, > and I am pretty sure this is something the kernel memory management > folks should give a blessing to. Second guessing the kernel like that > is just a bad idea if you ask me. There's no first or second guessing. The kernel oom-killer is strictly responsible for maintaining enough resources for the kernel. Not system responsivity. The idea of user space oom management is to take user space priorities into account, which kernel folks have rather intentionally stayed out of answering. > I mean, yes, the OOM killer might not be that great currently, but > this sounds like something to fix in kernel land, and if that doesn't > work out for some reason because kernel devs can't agree, then do it > as fallback in userspace, but with sound input from the kernel folks, > and the blessing of at least some of the kernel folks. The kernel oom-killler works exactly as intended and designed to work. It does not give either one or two shits about user space. It cares only about proper kernel function. And to that end it's working 100% effectively near as I can tell. The mistake most people are making is the idea the kernel oom-killer is intended as a user space, let alone end user, advocate. That is what earlyoom and other user space oom managers are trying to do. I do rather like your idea from some months ago, about moving to systems that have a much smaller swap during normal use, and only creating+activating a large swap at hibernation time. And after resuming from hibernation, deactivating the large swap. That way during normal operation, only "incidental" swap is allowed, and heavy swap very quickly has nowhere to go. And tied to that, a way to restrict the resources unprivileged processes get, rather than being allowed to overcommit - something low-memory-monitor attempts to achieve. -- Chris Murphy _______________________________________________ devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx