On Wed, 18 Apr 2018, Tetsuo Handa wrote: > > Commit 97b1255cb27c is referencing MMF_OOM_SKIP already being set by > > exit_mmap(). The only thing this patch changes is where that is done: > > before or after free_pgtables(). We can certainly move it to before > > free_pgtables() at the risk of subsequent (and eventually unnecessary) oom > > kills. It's not exactly the point of this patch. > > > > I have thousands of real-world examples where additional processes were > > oom killed while the original victim was in free_pgtables(). That's why > > we've moved the MMF_OOM_SKIP to after free_pgtables(). > > "we have moved"? No, not yet. Your patch is about to move it. > I'm referring to our own kernel, we have thousands of real-world examples where additional processes have been oom killed where the original victim is in free_pgtables(). It actually happens about 10-15% of the time in automated testing where you create a 128MB memcg, fork a canary, and then fork a >128MB memory hog. 10-15% of the time both processes get oom killed: the memory hog first (higher rss), the canary second. The pgtable stat is unchanged between oom kills. > My question is: is it guaranteed that munlock_vma_pages_all()/unmap_vmas()/free_pgtables() > by exit_mmap() are never blocked for memory allocation. Note that exit_mmap() tries to unmap > all pages while the OOM reaper tries to unmap only safe pages. If there is possibility that > munlock_vma_pages_all()/unmap_vmas()/free_pgtables() by exit_mmap() are blocked for memory > allocation, your patch will introduce an OOM livelock. > If munlock_vma_pages_all(), unmap_vmas(), or free_pgtables() require memory to make forward progress, then we have bigger problems :) I just ran a query of real-world oom kill logs that I have. In 33,773,705 oom kills, I have no evidence of a thread failing to exit after reaching exit_mmap(). You may recall from my support of your patch to emit the stack trace when the oom reaper fails, in https://marc.info/?l=linux-mm&m=152157881518627, that I have logs of 28,222,058 occurrences of the oom reaper where it successfully frees memory and the victim exits. If you'd like to pursue the possibility that exit_mmap() blocks before freeing memory that we have somehow been lucky to miss in 33 million occurrences, I'd appreciate the test case.