On Mon 22-12-14 07:42:49, Dave Chinner wrote: [...] > "memory reclaim gave up"? So why the hell isn't it returning a > failure to the caller? > > i.e. We have a perfectly good page cache allocation failure error > path here all the way back to userspace, but we're invoking the > OOM-killer to kill random processes rather than returning ENOMEM to > the processes that are generating the memory demand? > > Further: when did the oom-killer become the primary method > of handling situations when memory allocation needs to fail? > __GFP_WAIT does *not* mean memory allocation can't fail - that's what > __GFP_NOFAIL means. And none of the page cache allocations use > __GFP_NOFAIL, so why aren't we getting an allocation failure before > the oom-killer is kicked? Well, it has been an unwritten rule that GFP_KERNEL allocations for low-order (<=PAGE_ALLOC_COSTLY_ORDER) never fail. This is a long ago decision which would be tricky to fix now without silently breaking a lot of code. Sad... Nevertheless the caller can prevent from an endless loop by using __GFP_NORETRY so this could be used as a workaround. The default should be opposite IMO and only those who really require some guarantee should use a special flag for that purpose. > > I guess __alloc_pages_direct_reclaim() returns NULL with did_some_progress > 0 > > so that __alloc_pages_may_oom() will not be called easily. As long as > > try_to_free_pages() returns non-zero, __alloc_pages_direct_reclaim() might > > return NULL with did_some_progress > 0. So, do_try_to_free_pages() is called > > for many times and is likely to return non-zero. And when > > __alloc_pages_may_oom() is called, TIF_MEMDIE is set on the thread waiting > > for mutex_lock(&"struct inode"->i_mutex) at xfs_file_buffered_aio_write() > > and I see no further progress. > > Of course - TIF_MEMDIE doesn't do anything to the task that is > blocked, and the SIGKILL signal can't be delivered until the syscall > completes or the kernel code checks for pending signals and handles > EINTR directly. Mutexes are uninterruptible by design so there's no > EINTR processing, hence the oom killer cannot make progress when > everything is blocked on mutexes waiting for memory allocation to > succeed or fail. > > i.e. until the lock holder exists from direct memory reclaim and > releases the locks it holds, the oom killer will not be able to save > the system. IOWs, the problem is that memory allocation is not > failing when it should.... > > Focussing on the OOM killer here is the wrong way to solve this > problem - the problem that needs to be solved is sane handling of > OOM conditions to avoid needing to invoke the OOM-killer... Completely agreed! [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>