On Mon, 6 Aug 2018, Michal Hocko wrote: > On Sat 04-08-18 22:29:46, Tetsuo Handa wrote: > > David Rientjes is complaining about current behavior that the OOM killer > > selects next OOM victim as soon as MMF_OOM_SKIP is set even if > > __oom_reap_task_mm() returned without any progress. > > > > To address this problem, this patch adds a timeout with whether the OOM > > score of an OOM victim's memory is decreasing over time as a feedback, > > after MMF_OOM_SKIP is set by the OOM reaper or exit_mmap(). > > I still hate any feedback mechanism based on time. We have seen that > these paths are completely non-deterministic time wise that building > any heuristic on top of it just sounds wrong. > > Yes we have problems that the oom reaper doesn't handle all types of > memory yet. We should cover most of reasonably large memory types by > now. There is still mlock to take care of and that would be much > preferable to work on ragardless the retry mechanism becuase this work > will simply not handle that case either. > > So I do not really see this would be an improvement. I still stand by my > argument that any retry mechanism should be based on the direct feedback > from the oom reaper rather than some magic "this took that long without > any progress". > At the risk of continually repeating the same statement, the oom reaper cannot provide the direct feedback for all possible memory freeing. Waking up periodically and finding mm->mmap_sem contended is one problem, but the other problem that I've already shown is the unnecessary oom killing of additional processes while a thread has already reached exit_mmap(). The oom reaper cannot free page tables which is problematic for malloc implementations such as tcmalloc that do not release virtual memory. For binaries with heaps that are very large, sometimes over 100GB, this is a substantial amount of memory and we have seen unnecessary oom killing before and during free_pgtables() of the victim. This is long after the oom reaper would operate on any mm.