On Mon 07-08-23 10:28:17, Roman Gushchin wrote: > On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote: > > On Mon 07-08-23 10:21:09, Chuyi Zhou wrote: > > > > > > > > > 在 2023/8/4 21:34, Michal Hocko 写道: > > > > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote: > > > > [...] > > > > > > + switch (bpf_oom_evaluate_task(task, oc, &points)) { > > > > > > + case -EOPNOTSUPP: break; /* No BPF policy */ > > > > > > + case -EBUSY: goto abort; /* abort search process */ > > > > > > + case 0: goto next; /* ignore process */ > > > > > > + default: goto select; /* note the task */ > > > > > > + } > > To be honest, I can't say I like it. IMO it's not really using the full bpf > potential and is too attached to the current oom implementation. TBH I am not sure we are able to come up with an interface that would ise the full BPF potential at this stage and I strongly believe that we should start by something that is good enough. > First, I'm a bit concerned about implicit restrictions we apply to bpf programs > which will be executed potentially thousands times under a very heavy memory > pressure. We will need to make sure that they don't allocate (much) memory, don't > take any locks which might deadlock with other memory allocations etc. > It will potentially require hard restrictions on what these programs can and can't > do and this is something that the bpf community will have to maintain long-term. Right, BPF callbacks operating under OOM situations will be really constrained but this is more or less by definition. Isn't it? > Second, if we're introducing bpf here (which I'm not yet convinced), > IMO we should use it in a more generic and expressive way. > Instead of adding hooks into the existing oom killer implementation, we can call > a bpf program before invoking the in-kernel oom killer and let it do whatever > it takes to free some memory. E.g. we can provide it with an API to kill individual > tasks as well as all tasks in a cgroup. > This approach is more generic and will allow to solve certain problems which > can't be solved by the current oom killer, e.g. deleting files from a tmpfs > instead of killing tasks. The aim of this proposal is to lift any heavy lifting steming from iterating tasks or cgroups which those BPF might need to make a decision. There are other ways of course and provide this iteration functionality as library functions but my BPF experience is very limited to say how easy is that. > So I think the alternative approach is to provide some sort of an interface to > pre-select oom victims in advance. E.g. on memcg level it can look like: > > echo PID >> memory.oom.victim_proc this is just a terrible interface TBH. Pids are very volatile objects. At the time oom killer reads this pid it might be a completely different process. > If the list is empty, the default oom killer is invoked. > If there are tasks, the first one is killed on OOM. > A similar interface can exist to choose between sibling cgroups: > > echo CGROUP_NAME >> memory.oom.victim_cgroup Slightly less volatile but not much better either. > This is just a rough idea. I am pretty sure that both policies could be implemetd by the proposed BPF interface though if you want something like that. -- Michal Hocko SUSE Labs