On Wed, Feb 18, 2015 at 01:16:02PM +0100, Michal Hocko wrote: > On Wed 18-02-15 21:48:59, Dave Chinner wrote: > > On Wed, Feb 18, 2015 at 09:25:02AM +0100, Michal Hocko wrote: > > > On Wed 18-02-15 09:54:30, Dave Chinner wrote: > [...] > > Also, this reads as an excuse for the OOM killer being broken and > > not fixing it. Keep in mind that we tell the memory alloc/reclaim > > subsystem that *we hold locks* when we call into it. That's what > > GFP_NOFS originally meant, and it's what it still means today in an > > XFS context. > > Sure, and OOM killer will not be invoked in NOFS context. See > __alloc_pages_may_oom and __GFP_FS check in there. So I do not see where > is the OOM killer broken. I suspect that the page cache missing the correct GFP_NOFS was one of the sources of the problems I've been seeing. However, the oom killer exceptions are not checked if __GFP_NOFAIL is present and so if we start using __GFP_NOFAIL then it will be called in GFP_NOFS contexts... > The crucial problem we are dealing with is not GFP_NOFAIL triggering the > OOM killer but a lock dependency introduced by the following sequence: > > taskA taskB taskC > lock(A) alloc() > alloc(gfp | __GFP_NOFAIL) lock(A) out_of_memory > # looping for ever if we select_bad_process > # cannot make any progress victim = taskB > > There is no way OOM killer can tell taskB is blocked and that there is > dependency between A and B (without lockdep). That is why I call NOFAIL > under a lock as dangerous and a bug. Sure. However, eventually the OOM killer with select task A to be killed because nothing else is working. That, at least, marks taskA with TIF_MEMDIE and gives us a potential way to break the deadlock. But the bigger problem is this: taskA taskB lock(A) alloc(GFP_NOFS|GFP_NOFAIL) lock(A) out_of_memory select_bad_process victim = taskB Because there is no way to *ever* resolve that dependency because taskA never leaves the allocator. Even if the oom killer selects taskA and set TIF_MEMDIE on it, the allocator ignores TIF_MEMDIE because GFP_NOFAIL is set and continues to loop. This is why GFP_NOFAIL is not a solution to the "never fail" alloation problem. The caller doing the "no fail" allocation _must be able to set failure policy_. i.e. the choice of aborting and shutting down because progress cannot be made, or continuing and hoping for forwards progress is owned by the allocating context, no the allocator. The memory allocation subsystem cannot make that choice for us as it has no concept of the failure characteristics of the allocating context. The situations in which this actually matters are extremely *rare* - we've had these allocaiton loops in XFS for > 13 years, and we might get a one or two reports a year of these "possible allocation deadlock" messages occurring. Changing *everything* for such a rare, unusual event is not an efficient use of time or resources. > > If the OOM killer is not obeying GFP_NOFS and deadlocking on locks > > that the invoking context holds, then that is a OOM killer bug, not > > a bug in the subsystem calling kmalloc(GFP_NOFS). > > I guess we are talking about different things here or what am I missing? >From my perspective, you are tightly focussed on one aspect of the problem and hence are not seeing the bigger picture: this is a corner case of behaviour in a "last hope", brute force memory reclaim technique that no production machine relies on for correct or performant operation. > [...] > > > In the meantime page allocator > > > should develop a proper diagnostic to help identify all the potential > > > dependencies. Next we should start thinking whether all the existing > > > GFP_NOFAIL paths are really necessary or the code can be > > > refactored/reimplemented to accept allocation failures. > > > > Last time the "just make filesystems handle memory allocation > > failures" I pointed out what that meant for XFS: dirty transaction > > rollback is required. That's freakin' complex, will double the > > memory footprint of transactions, roughly double the CPU cost, and > > greatly increase the complexity of the transaction subsystem. It's a > > *major* rework of a significant amount of the XFS codebase and will > > take at least a couple of years design, test and stabilise before > > it could be rolled out to production. > > > > I'm not about to spend a couple of years rewriting XFS just so the > > VM can get rid of a GFP_NOFAIL user. Especially as the we already > > tell the Hammer of Last Resort the context in which it can work. > > > > Move the OOM killer to kswapd - get it out of the direct reclaim > > path altogether. > > This doesn't change anything as explained in other email. The triggering > path doesn't wait for the victim to die. But it does - we wouldn't be talking about deadlocks if there were no blocking dependencies. In this case, allocation keeps retrying until the memory freed by the killed tasks enables it to make forward progress. That's a side effect of the last relevation that was made in this thread that low order allocations never fail... > > If the system is that backed up on locks that it > > cannot free any memory and has no reserves to satisfy the allocation > > that kicked the OOM killer, then the OOM killer was not invoked soon > > enough. > > > > Hell, if you want a better way to proceed, then how about you allow > > us to tell the MM subsystem how much memory reserve a specific set > > of operations is going to require to complete? That's something that > > we can do rough calculations for, and it integrates straight into > > the existing transaction reservation system we already use for log > > space and disk space, and we can tell the mm subsystem when the > > reserve is no longer needed (i.e. last thing in transaction commit). > > > > That way we don't start a transaction until the mm subsystem has > > reserved enough pages for us to work with, and the reserve only > > needs to be used when normal allocation has already failed. i.e > > rather than looping we get a page allocated from the reserve pool. > > I am not sure I understand the above but isn't the mempools a tool for > this purpose? I knew this question would be the next one - I even deleted a one line comment from my last email that said "And no, mempools are not a solution" because that needs a more thorough explanation than a dismissive one-liner. As you know, mempools require a forward progress guarantee on a single type of object and the objects must be slab based. In transaction context we allocate from inode slabs, xfs_buf slabs, log item slabs (6 different ones, IIRC), btree cursor slabs, etc, but then we also have direct page allocations for buffers, vm_map_ram() for mapping multi-page buffers, uncounted heap allocations, etc. We cannot make all of these mempools, nor can me meet the forwards progress requirements of a mempool because other allocations can block and prevent progress. Further, the object have lifetimes that don't correspond to the transaction life cycles, and hence even if we complete the transaction there is no guarantee that the objects allocated within a transaction are going to be returned to the mempool at it's completion. IOWs, we have need for forward allocation progress guarantees on (potentially) several megabytes of allocations from slab caches, the heap and the page allocator, with all allocations all in unpredictable order, with objects of different life times and life cycles, and at which may, at any time, get stuck behind objects locked in other transactions and hence can randomly block until some other thread makes forward progress and completes a transaction and unlocks the object. The reservation would only need to cover the memory we need to allocate and hold in the transaction (i.e. dirtied objects). There is potentially unbound amounts of memory required through demand paging of buffers to find the metadata we need to modify, but demand paged metadata that is read and then released is recoverable. i.e the shrinkers will free it as other memory demand requires, so it's not included in reservation pools because it doesn't deplete the amount of free memory. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>