Re: Fedora 34 Change: Enable systemd-oomd by default for all variants (System-Wide Change)

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Tue, 22 Dec 2020 12:53:48 -0700

On Tue, Dec 22, 2020 at 11:42 AM Robbie Harwood <rharwood@xxxxxxxxxx> wrote:
>
>
> I believe you are assuming the consequent when you suggest that kernel
> developers should be somehow fixing this in userspace.
>
> To back up: the described problem is the manifestation of an interaction
> between swap and the OOM condition.  The OOM killer is a
> popularly-understood piece of what goes on in the system during OOM, but
> it's not like the rest of the kernel can be ignored.  (I would argue
> that part of the reason it's well understood is their insistence that it
> remain simple, but that's getting off into the weeds.)

No the problem happens any time a resource becomes constrained: cpu,
memory, io. It's not exclusively a swap problem.

When swap pressure is part of the problem, it depends on how swap is
being used. Heavy IO page out is a good thing. Heavy IO page-in and
page-out of the same pages is a bad thing.

>
> So, several control points here:
>
> - OOM killer behavior.  I think we're in agreement that this isn't the
>   thing that needs changed.

If you mean the kernel oomkiller, yeah probably. That's generally
considered to be working correctly. It keeps the kernel functioning,
with forward progress being made without any respect whatsoever to
user space priorities like system responsiveness.

> - Enabling swap.  Swap is really slow (by virtue of not being RAM...)
>   and we don't seem willing to acknowledge that.  If we want the system
>   to be snappy and responsive... we shouldn't be swapping.

This is not entirely correct. The chosen workload might be excessive
compared to the allocated resources. That does happen, I see it from
time to time, but it's not that common because it results in a lot of
pain for the user. So they stop doing it. This is an underprovisioned
system.

If you aren't swapping at all, that means you have allocated more
resources than the workload requires. You've over provisioned. This is
apparently quite common in the Kubernetes workflow, because Kubernetes
doesn't work properly with swap, somehow by design. So their view is,
don't create a swap device, just overprovision.

Swap is for evicting anonymous pages, pages that aren't backed by any
kind of file. If inactive anonymous pages can't be swapped, they have
to stay in memory. And when memory is under pressure, the kernel has
no choice but to resort to reclaim, i.e. evict pages that are backed
by files. This will end up looking a lot like swap thrashing.

Another factor is there have been recent improvements in the swap code
to make dirty page eviction much better and avoid swap thrashing.
You'll need a 5.10 kernel for the most recent work on this.

>
> - Swap aggressiveness.  Suggested by above, people want swap anyway.
>   (Sometimes it's for hibernation (not supported, but that stops no
>   one), sometimes it's for... historical reasons?  Underprovisioning?)
>   This could be tuned to the use cases we actually want.

The idea of proper resource control is to use swap more effectively,
to reduce the heavy swap thrashing. It's not a problem to do dirty
page eviction (page out). That frees memory and makes it less likely
other processes will thrash.

>
> - Education.  Get people to a point where admins don't deploy swap on
>   systems that aren't going to hibernate.  I'll readily admit this one
>   might be hardest.

That is bad advice. We do need swap.

https://chrisdown.name/2018/01/02/in-defence-of-swap.html

There's a nice tl;dr at the top and a summary at the bottom. And quite
detailed explanation in the middle.

> And even possibly the (conceptually) simplest solution of all:
>
> - Swap usage monitoring as described for oomd... but in the kernel.
>   This saves you on all the overhead of running in userspace, if nothing
>   else.

This exists in the form of PSI, as well as cgroupsv2:
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

memory.swap.current
memory.swap.events
memory.swap.high
memory.swap.max

> But what really bothers me here is that, to my knowledge, no one has
> tried to actually make any of these happen in the kernel.  There's a
> vague perception of what "the kernel devs" want, as if they're some
> other, but... has anyone asked?  If so, we should be able to quote what
> the response was, and a good design proposal should include it as an
> explanation of why that route wasn't taken.

I'm not even sure what you're asking for. There is no such thing as a
one size fits all set of policies for resource control. There are
kernel-side components for this, as well as user space, to implement a
policy.

--
Chris Murphy
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx