Re: [RFC PATCH 1/2] mm, oom: Introduce bpf_select_task

Michal Hocko <mhocko@xxxxxxxx> · Tue, 8 Aug 2023 10:18:39 +0200

On Mon 07-08-23 10:28:17, Roman Gushchin wrote:
> On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote:
> > On Mon 07-08-23 10:21:09, Chuyi Zhou wrote:
> > > 
> > > 
> > > 在 2023/8/4 21:34, Michal Hocko 写道:
> > > > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote:
> > > > [...]
> > > > > > +	switch (bpf_oom_evaluate_task(task, oc, &points)) {
> > > > > > +		case -EOPNOTSUPP: break; /* No BPF policy */
> > > > > > +		case -EBUSY: goto abort; /* abort search process */
> > > > > > +		case 0: goto next; /* ignore process */
> > > > > > +		default: goto select; /* note the task */
> > > > > > +	}
> 
> To be honest, I can't say I like it. IMO it's not really using the full bpf
> potential and is too attached to the current oom implementation.

TBH I am not sure we are able to come up with an interface that would
ise the full BPF potential at this stage and I strongly believe that we
should start by something that is good enough.

> First, I'm a bit concerned about implicit restrictions we apply to bpf programs
> which will be executed potentially thousands times under a very heavy memory
> pressure. We will need to make sure that they don't allocate (much) memory, don't
> take any locks which might deadlock with other memory allocations etc.
> It will potentially require hard restrictions on what these programs can and can't
> do and this is something that the bpf community will have to maintain long-term.

Right, BPF callbacks operating under OOM situations will be really
constrained but this is more or less by definition. Isn't it?

> Second, if we're introducing bpf here (which I'm not yet convinced),
> IMO we should use it in a more generic and expressive way.
> Instead of adding hooks into the existing oom killer implementation, we can call
> a bpf program before invoking the in-kernel oom killer and let it do whatever
> it takes to free some memory. E.g. we can provide it with an API to kill individual
> tasks as well as all tasks in a cgroup.
> This approach is more generic and will allow to solve certain problems which
> can't be solved by the current oom killer, e.g. deleting files from a tmpfs
> instead of killing tasks.

The aim of this proposal is to lift any heavy lifting steming from
iterating tasks or cgroups which those BPF might need to make a
decision. There are other ways of course and provide this iteration
functionality as library functions but my BPF experience is very limited
to say how easy is that.

> So I think the alternative approach is to provide some sort of an interface to
> pre-select oom victims in advance. E.g. on memcg level it can look like:
> 
> echo PID >> memory.oom.victim_proc

this is just a terrible interface TBH. Pids are very volatile objects.
At the time oom killer reads this pid it might be a completely different
process.

> If the list is empty, the default oom killer is invoked.
> If there are tasks, the first one is killed on OOM.
> A similar interface can exist to choose between sibling cgroups:
> 
> echo CGROUP_NAME >> memory.oom.victim_cgroup

Slightly less volatile but not much better either.

> This is just a rough idea.

I am pretty sure that both policies could be implemetd by the proposed
BPF interface though if you want something like that.
-- 
Michal Hocko
SUSE Labs