On Fri 08-07-22 17:25:31, Gang Li wrote: > Oh apologize. I just realized what you mean. > > I should try a "cpuset cgroup oom killer" selecting victim from a > specific cpuset cgroup. yes, that was the idea. Many workloads which really do care about particioning the NUMA system tend to use cpusets. In those cases you have reasonably defined boundaries and the current OOM killer imeplementation is not really aware of that. The oom selection process could be enhanced/fixed to select victims from those cpusets similar to how memcg oom killer victim selection is done. There is no additional accounting required for this approach because the workload is partitioned on the cgroup level already. Maybe this is not really the best fit for all workloads but it should be reasonably simple to implement without intrusive or runtime visible changes. I am not saying per-numa accounting is wrong or a bad idea. I would just like to see a stronger justification for that and also some arguments why a simpler approach via cpusets is not viable. Does this make sense to you? -- Michal Hocko SUSE Labs