On Mon 20-08-18 14:31:04, David Rientjes wrote: > On Mon, 20 Aug 2018, Michal Hocko wrote: > > > > The oom reaper will always be unable to free some memory, such as page > > > tables. If it can't grab mm->mmap_sem in a reasonable amount of time, it > > > also can give up early. The munlock() case is another example. We > > > experience unnecessary oom killing during free_pgtables() where the > > > single-threaded exit_mmap() is freeing an enormous amount of page tables > > > (usually a malloc implementation such as tcmalloc that does not free > > > virtual memory) and other processes are faulting faster than we can free. > > > It's a combination of a multiprocessor system and a lot of virtual memory > > > from the original victim. This is the same case as being unable to > > > munlock quickly enough in exit_mmap() to free the memory. > > > > > > We must wait until free_pgtables() completes in exit_mmap() before killing > > > additional processes in the large majority (99.96% of cases from my data) > > > of instances where oom livelock does not occur. In the remainder of > > > situations, livelock has been prevented by what the oom reaper has been > > > able to free. We can, of course, not do free_pgtables() from the oom > > > reaper. So my approach was to allow for a reasonable amount of time for > > > the victim to free a lot of memory before declaring that additional > > > processes must be oom killed. It would be functionally similar to having > > > the oom reaper retry many, many more times than 10 and having a linked > > > list of mm_structs to reap. I don't care one way or another if it's a > > > timeout based solution or many, many retries that have schedule_timeout() > > > that yields the same time period in the end. > > > > I would really keep the current retry logic with an extension to allow > > to keep retrying or hand over to exit_mmap when we know it is past the > > last moment of blocking. > > > > Ok, so it appears you're suggesting a per-mm counter of oom reaper retries > and once it reaches a certain threshold, either give up and set > MMF_OOM_SKIP or declare that exit_mmap() is responsible for it. That's > fine, but obviously I'll be suggesting that the threshold is rather large. > So if I adjust my patch to be a retry counter rather than timestamp, do > you have any other reservations? It absolutely has to be an internal thing without any user API to be set. Also I still haven't heard any specific argument why would oom reaper need to do per-task attempt and loop over all victims on the list. Maybe you have some examples though. I believe that we really need to think about the hand over between the two paths first and only build a more elaborate retry logic on top of it. -- Michal Hocko SUSE Labs