Michal Hocko wrote: > This has been posted in various forms many times over past years. I > still do not think this is a right approach of dealing with the problem. I do not think "GFP_NOFS can fail" patch is a right approach because that patch easily causes messages like below. Buffer I/O error on dev sda1, logical block 34661831, lost async page write XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250) XFS: possible memory allocation deadlock in xfs_buf_allocate_memory (mode:0x250) XFS: possible memory allocation deadlock in kmem_zone_alloc (mode:0x8250) Adding __GFP_NOFAIL will hide these messages but OOM stall remains anyway. I believe choosing more OOM victims is the only way which can solve OOM stalls. > You can quickly deplete memory reserves this way without making further > progress (I am afraid you can even trigger this from userspace without > having big privileges) so even administrator will have no way to > intervene. I think that use of ALLOC_NO_WATERMARKS via TIF_MEMDIE is the underlying cause. ALLOC_NO_WATERMARKS via TIF_MEMDIE is intended for terminating the OOM victim task as soon as possible, but it turned out that it will not work if there is invisible lock dependency. Therefore, why not to give up "there should be only up to 1 TIF_MEMDIE task" rule? What this patch (and many others posted in various forms many times over past years) does is to give up "there should be only up to 1 TIF_MEMDIE task" rule. I think that we need to tolerate more than 1 TIF_MEMDIE tasks and somehow manage in a way memory reserves will not deplete. In my proposal which favors all fatal_signal_pending() tasks evenly ( http://lkml.kernel.org/r/201509102318.GHG18789.OHMSLFJOQFOtFV@xxxxxxxxxxxxxxxxxxx ) suggests that the OOM victim task unlikely needs all of memory reserves. In other words, the OOM victim task can likely make forward progress if some amount of memory reserves are allowed (compared to normal tasks waiting for memory). So, I think that getting rid of "ALLOC_NO_WATERMARKS via TIF_MEMDIE" rule and replace test_thread_flag(TIF_MEMDIE) with fatal_signal_pending(current) will handle many cases if fatal_signal_pending() tasks are allowed to access some amount of memory reserves. And my proposal which chooses next OOM victim upon timeout will handle the remaining cases without depleting memory reserves. If you still want to keep "there should be only up to 1 TIF_MEMDIE task" rule, what alternative do you have? (I do not like panic_on_oom_timeout because it is more data-lossy approach than choosing next OOM victim.) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>