On 2018/11/14 19:16, Michal Hocko wrote: > On Wed 14-11-18 18:46:13, Tetsuo Handa wrote: > [...] > > There is always an invisible lock called "scheduling priority". You can't > > leave the MMF_OOM_SKIP to the exit path. Your approach is not ready for > > handling the worst case. > > And that problem is all over the memory reclaim. You can get starved > to death and block other resources. And the memory reclaim is not the > only one. I think that it is a manner for kernel developers that no thread keeps consuming CPU resources forever. In the kernel world, doing while (1); is not permitted. Likewise, doing for (i = 0; i < very_large_value; i++) do_something_which_does_not_yield_CPU_to_others(); has to be avoided, in order to avoid lockup problems. We are required to yield CPU to others when we are waiting for somebody else to make progress. It is the page allocator who is refusing to yield CPU to those who need CPU. Since the OOM reaper kernel thread "has normal priority" and "can run on any CPU", the possibility of failing to run is lower than an OOM victim thread which "has idle priority" and "can run on only limited CPU". You are trying to add a dependency on such thread, and I'm saying that adding a dependency on such thread increases possibility of lockup. Yes, even the OOM reaper kernel thread might fail to run if all CPUs were busy with realtime threads waiting for the OOM reaper kernel thread to make progress. In that case, we had better stop relying on asynchronous memory reclaim, and switch to direct OOM reaping by allocating threads. But what I demonstrated is that /* * the exit path is guaranteed to finish the memory tear down * without any unbound blocking at this stage so make it clear * to the oom_reaper */ becomes a lie even when only one CPU was busy with realtime threads waiting for an idle thread to make progress. If the page allocator stops telling a lie that "an OOM victim is making progress on behalf of me", we can avoid the lockup. > This is a fundamental issue of the locking without priority > inheritance and other real time techniques. That is nothing but an evidence that you are refusing to solve the possibility of lockup, and will keep piling up stupid lockup bugs. OOM handling does not use locks when waiting for somebody else to make progress. Blaming "the locking without priority inheritance" is wrong. Each subsystem is responsible for avoiding the lockup. If one subsystem is triggering lockup due to printk() flooding, that subsystem avoids the lockup by stop abusing CPU resources by reducing the printk() messages. That's all we can do for now. MM is not privileged enough to lockup the system. > > > Nacked-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> > > OK, if your whole point is to block any changes into this area unless > it is your solution to be merged then I guess I will just push patches > through with your explicit Nack on them. Your feedback was far fetched > at many times has distracted the discussion way too often. This is > especially sad because your testing and review was really helpful at > times. I do not really have energy to argue the same set of arguments > over and over again. > > You have expressed unwillingness to understand the overall > picture several times. You do not care about a long term maintenance > burden of this code which is quite tricky already and refuse to > understand the cost/benefit part. > > If this series works for the workload reported by David I will simply > push it through and let Andrew decide. I'm skeptic about this approach. Given that exit_mmap() has to do more things than what the OOM reaper kernel thread can do, how likely the OOM reaper kernel thread can find that exit_mmap() has completed both fullset of unmap_vmas() and __unlink_vmas() within (at most) one second after the OOM reaper completed only subset of unmap_vmas() ? If exit_mmap() was preempted (due to scheduling priority), the OOM reaper might likely fail to find it. Anyway, if you insist this approach, I expect that exit_mmap() asks the OOM reaper kernel thread to call __free_pgtables(), and the OOM reaper kernel thread sets MMF_OOM_SKIP, and exit_mmap() resumes remove_vma() etc. , for the OOM reaper kernel thread ("has normal priority" and "can run on any CPU") is more reliable than a random thread ("has idle priority" and "can run on only limited CPU"). Given that the OOM reaper likely can find it, there are lack of reviews from each arch maintainers (regarding whether this approach is really safe, and whether this approach constrains future improvements). > If there is a lack of feedback > I will just keep it around because it seems that most users do not care > about these corner cases anyway. No way. We are always failing to get attention regarding OOM handling. ;-) Even how to avoid flooding by dump_header(oc, NULL); pr_warn("Out of memory and no killable processes...\n"); for memcg OOM is failing to make progress. People make changes without thinking about OOM handling. If you push this approach, solve the lack of reviews.