Memory reclaim protection and cgroup nesting (desktop use)

Benjamin Berg <benjamin@xxxxxxxxxxxxxxxx> · Wed, 04 Mar 2020 10:44:44 +0100

Hi,

TL;DR: I seem to need memory.min/memory.max to be set on each child
cgroup and not just the parents. Is this expected?

I have been experimenting with using cgroups to protect a GNOME
session. The intention is that the GNOME Shell itself and important
other services remain responsive, even if the application workload is
thrashing. The long term goal here is to bridge the time until an OOM
killer like oomd would get the system back into normal conditions using
memory pressure information.

Note that I have done these tests without any swap and with huge
memory.min/memory.low values. I consider this scenario pathological,
however, it seems like a reasonable way to really exercise the cgroup
reclaim protection logic.

The resulting cgroup hierarchy looked something like:

-.slice
├─user.slice
│ └─user-1000.slice
│   ├─user@1000.service
│   │ ├─session.slice
│   │ │ ├─gsd-*.service
│   │ │ │ └─208803 /usr/libexec/gsd-rfkill
│   │ │ ├─gnome-shell-wayland.service
│   │ │ │ ├─208493 /usr/bin/gnome-shell
│   │ │ │ ├─208549 /usr/bin/Xwayland :0 -rootless -noreset -accessx -core -auth /run/user/1000/.mutter-Xwayla>
│   │ │ │ └─ …
│   │ └─apps.slice
│   │   ├─gnome-launched-tracker-miner-fs.desktop-208880.scope
│   │   │ └─208880 /usr/libexec/tracker-miner-fs
│   │   ├─dbus-:1.2-org.gnome.OnlineAccounts@0.service
│   │   │ └─208668 /usr/libexec/goa-daemon
│   │   ├─flatpak-org.gnome.Fractal-210350.scope
│   │   ├─gnome-terminal-server.service
│   │   │ ├─209261 /usr/libexec/gnome-terminal-server
│   │   │ ├─209434 bash
│   │   │ └─ … including the test load i.e. "make -j32" of a C++ code

I also enabled the CPU and IO controllers in my tests, but I don't
think that is as relevant. The main thing is that I set
  memory.min: 2GiB
  memory.low: 4GiB

using systemd on all of

 * user.slice,
 * user-1000.slice,
 * user@1000.slice,
 * session.slice and
 * everything inside session.slice
   (i.e. gnome-shell-wayland.service, gsd-*.service, …)

excluding apps.slice from protection.

(In a realistic scenario I expect to have swap and then reserving maybe
a few hundred MiB; DAMON might help with finding good values.)

At that point, the protection started working pretty much flawlessly.
i.e. my gnome-shell would continue to run without major page faulting
even though everything in apps.slice was thrashing heavily. The
mouse/keyboard remained completely responsive, and interacting with
applications ended up working much better thanks to knowing where input
was going. Even if the applications themselves took seconds to react.

So far, so good. What surprises me is that I needed to set the
protection on the child cgroups (i.e. gnome-shell-wayland.service).
Without this, it would not work (reliably) and my gnome-shell would
still have a lot of re-faults to load libraries and other mmap'ed data
back into memory (I used "perf --no-syscalls -F" to trace this and
observed these to be repeatedly for the same pages loading e.g.
functions for execution).

Due to accounting effects, I would expect re-faults to happen up to one
time in this scenario. At that point the page in question will be
accounted against the shell's cgroup and reclaim protection could kick
in. Unfortunately, that did not seem to happen unless the shell's
cgroup itself had protections and not just all of its parents.

Is it expected that I need to set limits on each child?

Benjamin
Attachment:
signature.asc

Description: This is a digitally signed message part