On Thu, Aug 17, 2023 at 8:30 PM Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx> wrote: > > Hello, > 在 2023/8/17 11:22, Alexei Starovoitov 写道: > > On Wed, Aug 16, 2023 at 7:51 PM Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx> wrote: > >> > >> Hello, > >> > >> 在 2023/8/17 10:07, Alexei Starovoitov 写道: > >>> On Thu, Aug 10, 2023 at 1:13 AM Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx> wrote: > >>>> static int oom_evaluate_task(struct task_struct *task, void *arg) > >>>> { > >>>> struct oom_control *oc = arg; > >>>> @@ -317,6 +339,26 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) > >>>> if (!is_memcg_oom(oc) && !oom_cpuset_eligible(task, oc)) > >>>> goto next; > >>>> > >>>> + /* > >>>> + * If task is allocating a lot of memory and has been marked to be > >>>> + * killed first if it triggers an oom, then select it. > >>>> + */ > >>>> + if (oom_task_origin(task)) { > >>>> + points = LONG_MAX; > >>>> + goto select; > >>>> + } > >>>> + > >>>> + switch (bpf_oom_evaluate_task(task, oc)) { > >>>> + case BPF_EVAL_ABORT: > >>>> + goto abort; /* abort search process */ > >>>> + case BPF_EVAL_NEXT: > >>>> + goto next; /* ignore the task */ > >>>> + case BPF_EVAL_SELECT: > >>>> + goto select; /* select the task */ > >>>> + default: > >>>> + break; /* No BPF policy */ > >>>> + } > >>>> + > >>> > >>> I think forcing bpf prog to look at every task is going to be limiting > >>> long term. > >>> It's more flexible to invoke bpf prog from out_of_memory() > >>> and if it doesn't choose a task then fallback to select_bad_process(). > >>> I believe that's what Roman was proposing. > >>> bpf can choose to iterate memcg or it might have some side knowledge > >>> that there are processes that can be set as oc->chosen right away, > >>> so it can skip the iteration. > >> > >> IIUC, We may need some new bpf features if we want to iterating > >> tasks/memcg in BPF, sush as: > >> bpf_for_each_task > >> bpf_for_each_memcg > >> bpf_for_each_task_in_memcg > >> ... > >> > >> It seems we have some work to do first in the BPF side. > >> Will these iterating features be useful in other BPF scenario except OOM > >> Policy? > > > > Yes. > > Use open coded iterators though. > > Like example in > > https://lore.kernel.org/all/20230810183513.684836-4-davemarchevsky@xxxxxx/ > > > > bpf_for_each(task_vma, vma, task, 0) { ... } > > will safely iterate vma-s of the task. > > Similarly struct css_task_iter can be hidden inside bpf open coded iterator. > OK. I think the following APIs whould be useful and I am willing to > start with these in another bpf-next RFC patchset: > > 1. bpf_for_each(task). Just like for_each_process(p) in kernel to > itearing all tasks in the system with rcu_read_lock(). > > 2. bpf_for_each(css_task, task, css). It works like > css_task_iter_{start, next, end} and would be used to iterating > tasks/threads under a css. > > 3. bpf_for_each(descendant_css, css, root_css, {PRE, POST}). It works > like css_next_descendant_{pre, post} to iterating all descendant. > > If you have better ideas or any advice, please let me know. Sounds great. Such 3 new iterators are unrelated to oom discussion and can be developed/landed in parallel. They will be useful in other bpf programs.