Re: [RFC PATCH 0/5] mm: Select victim memcg using BPF_OOM_POLICY

Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx> · Mon, 31 Jul 2023 14:00:22 +0800

Hello, Michal

在 2023/7/28 01:23, Michal Hocko 写道:
On Thu 27-07-23 20:12:01, Chuyi Zhou wrote:

在 2023/7/27 16:15, Michal Hocko 写道:
On Thu 27-07-23 15:36:27, Chuyi Zhou wrote:
This patchset tries to add a new bpf prog type and use it to select
a victim memcg when global OOM is invoked. The mainly motivation is
the need to customizable OOM victim selection functionality so that
we can protect more important app from OOM killer.

This is rather modest to give an idea how the whole thing is supposed to
work. I have looked through patches very quickly but there is no overall
design described anywhere either.

Please could you give us a high level design description and reasoning
why certain decisions have been made? e.g. why is this limited to the
global oom sitation, why is the BPF program forced to operate on memcgs
as entities etc...
Also it would be very helpful to call out limitations of the BPF
program, if there are any.

Thanks!

Hi,

Thanks for your advice.

The global/memcg OOM victim selection uses process as the base search
granularity. However, we can see a need for cgroup level protection and
there's been some discussion[1]. It seems reasonable to consider using memcg
as a search granularity in victim selection algorithm.

Yes, it can be reasonable for some policies but making it central to the
design is very limiting.

Besides, it seems pretty well fit for offloading policy decisions to a BPF
program, since BPF is scalable and flexible. That's why the new BPF
program operate on memcgs as entities.

I do not follow your line of argumentation here. The same could be
argued for processes or beans.

The idea is to let user choose which leaf in the memcg tree should be
selected as the victim. At the first layer, if we choose A, then it protects
the memcg under the B, C, and D subtrees.

         root
      /   ｜ \  \
     A    B  C  D
    /\
   E F

Using the BPF prog, we are allowed to compare the OOM priority between
two siblings so that we can choose the best victim in each layer.

How is the priority defined and communicated to the userspace.

For example:

run_prog(B, C) -> choose B
run_prog(B, D) -> choose D
run_prog(A, D) -> choose A

Once we select A as the victim in the first layer, the victim in next layer
would be selected among A's children. Finally, we select a leaf memcg as
victim.

This sounds like a very specific oom policy and that is fine. But the
interface shouldn't be bound to any concepts like priorities let alone
be bound to memcg based selection. Ideally the BPF program should get
the oom_control as an input and either get a hook to kill process or if
that is not possible then return an entity to kill (either process or
set of processes).

Here are two interfaces I can think of. I was wondering if you could 
give me some feedback.

1. Add a new hook in select_bad_process(), we can attach it and return a 
set of pids or cgroup_ids which are pre-selected by user-defined policy, 
 suggested by Roman. Then we could use oom_evaluate_task to find a 
final victim among them. It's user-friendly and we can offload the OOM 
policy to userspace.

2. Add a new hook in oom_evaluate_task() and return a point to override 
the default oom_badness return-value. The simplest way to use this is to 
protect certain processes by setting the minimum score.

Of course if you have a better idea, please let me know.

Thanks!
---
Chuyi Zhou

In our scenarios, the impact caused by global OOM's is much more common, so
we only considered global in this patchset. But it seems that the idea can
also be applied to memcg OOM.

The global and memcg OOMs shouldn't have a different interface. If a
specific BPF program wants to implement a different policy for global
vs. memcg OOM then be it but this should be a decision of the said
program not an inherent limitation of the interface.

[1]https://lore.kernel.org/lkml/ZIgodGWoC%2FR07eak@xxxxxxxxxxxxxx/

Thanks!
--
Chuyi Zhou