On Sat 05-08-23 07:55:56, Chuyi Zhou wrote: > Hello, > > 在 2023/8/4 19:34, Alan Maguire 写道: [...] > > I don't know anything about OOM mechanisms, so maybe it's just me, but I > > found this confusing. Relying on the previous iteration to control > > current iteration behaviour seems risky - even if BPF found a victim in > > iteration N, it's no guarantee it will in iteration N+1. > > > The current kernel's OOM actually works like this: > > 1. if we first find a valid candidate victim A in iteration N, we would > record it in oc->chosen. > > 2. In iteration N + 1, N+2..., we just compare oc->chosen with the current > iterating task. Suppose we think current task B is better than > oc->chosen(A), we would set oc->chosen = B and we would not consider A > anymore. > > IIUC, most policy works like this. We just need to find the *most* suitable > victim. Normally, if in current iteration we drop A and select B, we would > not consider A anymore. Yes, we iterate over all tasks in the specific oom domain (all tasks for global and all members of memcg tree for hard limit oom). The in-tree oom policy has to iterate all tasks to achieve some of its goals (like preventing overkilling while the previously selected victim is still on the way out). Also oom_score_adj might change the final decision so you have to really check all eligible tasks. I can imagine a BPF based policy could be less constrained and as Roman suggested have a pre-selected victims on stand by. I do not see problem to have break like mode. Similar to current abort without a canceling an already noted victim. -- Michal Hocko SUSE Labs