On Thu, 18 May 2017, Michal Hocko wrote: > > See above. OOM Kill in a cpuset does not kill an innocent task but a task > > that does an allocation in that specific context meaning a task in that > > cpuset that also has a memory policty. > > No, the oom killer will chose the largest task in the specific NUMA > domain. If you just fail such an allocation then a page fault would get > VM_FAULT_OOM and pagefault_out_of_memory would kill a task regardless of > the cpusets. Ok someone screwed up that code. There still is the determination that we have a constrained alloc: oom_kill: /* * Check if there were limitations on the allocation (only relevant for * NUMA and memcg) that may require different handling. */ constraint = constrained_alloc(oc); if (constraint != CONSTRAINT_MEMORY_POLICY) oc->nodemask = NULL; check_panic_on_oom(oc, constraint); -- Ok. A constrained failing alloc used to terminate the allocating process here. But it falls through to selecting a "bad process" if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task && current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) && current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) { get_task_struct(current); oc->chosen = current; oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)"); return true; } -- A constrained allocation should not get here but fail the process that attempts the alloc. select_bad_process(oc); Can we restore the old behavior? If I just specify the right memory policy I can cause other processes to just be terminated? > > Regardless of that the point earlier was that the moving logic can avoid > > creating temporary situations of empty sets of nodes by analysing the > > memory policies etc and only performing moves when doing so is safe. > > How are you going to do that in a raceless way? Moreover the whole > discussion is about _failing_ allocations on an empty cpuset and > mempolicy intersection. Again this is only working for processes that are well behaved and it never worked in a different way before. There was always the assumption that a process does not allocate in the areas that have allocation constraints and that the process does not change memory policies nor store them somewhere for late etc etc. HPC apps typically allocate memory on startup and then go through long times of processing and I/O. The idea that cpuset node to node migration will work with a running process that does abitrary activity is a pipe dream that we should give up. There must be constraints on a process in order to allow this to work and as far as I can tell this is best done in userspace with a library and by putting requirements on the applications that desire to be movable that way. F.e. an application that does not use memory policies or other allocation constraints should be fine. That has been working. -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html