On Tue 11-04-23 14:58:15, Gang Li wrote: > Cpusets constrain the CPU and Memory placement of tasks. > `CONSTRAINT_CPUSET` type in oom has existed for a long time, but > has never been utilized. > > When a process in cpuset which constrain memory placement triggers > oom, it may kill a completely irrelevant process on other numa nodes, > which will not release any memory for this cpuset. > > We can easily achieve node aware oom by using `CONSTRAINT_CPUSET` and > selecting victim from cpusets with the same mems_allowed as the > current one. I believe it still wouldn't hurt to be more specific here. CONSTRAINT_CPUSET is rather obscure. Looking at this just makes my head spin. /* Check this allocation failure is caused by cpuset's wall function */ for_each_zone_zonelist_nodemask(zone, z, oc->zonelist, highest_zoneidx, oc->nodemask) if (!cpuset_zone_allowed(zone, oc->gfp_mask)) cpuset_limited = true; Does this even work properly and why? prepare_alloc_pages sets oc->nodemask to current->mems_allowed but the above gives us cpuset_limited only if there is at least one zone/node that is not oc->nodemask compatible. So it seems like this wouldn't ever get set unless oc->nodemask got reset somewhere. This is a maze indeed. Is there any reason why we cannot rely on __GFP_HARWALL here? Or should we instead rely on the fact the nodemask should be same as current->mems_allowed? I do realize that this is not directly related to your patch but considering this has been mostly doing nothing maybe we want to document it better or even rework it at this occasion. > Example: > > Create two processes named mem_on_node0 and mem_on_node1 constrained > by cpusets respectively. These two processes alloc memory on their > own node. Now node0 has run out of memory, OOM will be invokled by > mem_on_node0. Don't you have an actual real life example with a properly partitioned system which clearly misbehaves and this patch addresses that? -- Michal Hocko SUSE Labs