On Fri, 25 Sep 2015, Michal Hocko wrote: > > > I am still not sure how you want to implement that kernel thread but I > > > am quite skeptical it would be very much useful because all the current > > > allocations which end up in the OOM killer path cannot simply back off > > > and drop the locks with the current allocator semantic. So they will > > > be sitting on top of unknown pile of locks whether you do an additional > > > reclaim (unmap the anon memory) in the direct OOM context or looping > > > in the allocator and waiting for kthread/workqueue to do its work. The > > > only argument that I can see is the stack usage but I haven't seen stack > > > overflows in the OOM path AFAIR. > > > > > > > Which locks are you specifically interested in? > > Any locks they were holding before they entered the page allocator (e.g. > i_mutex is the easiest one to trigger from the userspace but mmap_sem > might be involved as well because we are doing kmalloc(GFP_KERNEL) with > mmap_sem held for write). Those would be locked until the page allocator > returns, which with the current semantic might be _never_. > I agree that i_mutex seems to be one of the most common offenders. However, I'm not sure I understand why holding it while trying to allocate infinitely for an order-0 allocation is problematic wrt the proposed kthread. The kthread itself need only take mmap_sem for read. If all threads sharing the mm with a victim have been SIGKILL'd, they should get TIF_MEMDIE set when reclaim fails and be able to allocate so that they can drop mmap_sem. We must ensure that any holder of mmap_sem cannot quickly deplete memory reserves without properly checking for fatal_signal_pending(). > > We have already discussed > > the usefulness of killing all threads on the system sharing the same ->mm, > > meaning all threads that are either holding or want to hold mm->mmap_sem > > will be able to allocate into memory reserves. Any allocator holding > > down_write(&mm->mmap_sem) should be able to allocate and drop its lock. > > (Are you concerned about MAP_POPULATE?) > > I am not sure I understand. We would have to fail the request in order > the context which requested the memory could drop the lock. Are we > talking about the same thing here? > Not fail the request, they should be able to allocate from memory reserves when TIF_MEMDIE gets set. This would require that threads is all gfp contexts are able to get TIF_MEMDIE set without an explicit call to out_of_memory() for !__GFP_FS. > > Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, > > it's the reason the code exists. Any optimizations to that is certainly > > welcome, but we definitely need to send SIGKILL to all threads sharing the > > mm to make forward progress, otherwise we are going back to pre-2008 > > livelocks. > > Yes but mm is not shared between processes most of the time. CLONE_VM > without CLONE_THREAD is more a corner case yet we have to crawl all the > task_structs for _each_ OOM killer invocation. Yes this is an extreme > slow path but still might take quite some unnecessarily time. > It must solve the issue you describe, killing other processes that share the ->mm, otherwise we have mm->mmap_sem livelock. We are not concerned about iterating over all task_structs in the oom killer as a painpoint, such users should already be using oom_kill_allocating_task which is why it was introduced. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>