On Fri 29-04-16 15:35:42, NeilBrown wrote: > On Tue, Apr 26 2016, Michal Hocko wrote: > > > Hi, > > we have discussed this topic at LSF/MM this year. There was a general > > interest in the scope GFP_NOFS allocation context among some FS > > developers. For those who are not aware of the discussion or the issue > > I am trying to sort out (or at least start in that direction) please > > have a look at patch 1 which adds memalloc_nofs_{save,restore} api > > which basically copies what we have for the scope GFP_NOIO allocation > > context. I haven't converted any of the FS myself because that is way > > beyond my area of expertise but I would be happy to help with further > > changes on the MM front as well as in some more generic code paths. > > > > Dave had an idea on how to further improve the reclaim context to be > > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque > > and FS specific cookie set in the FS allocation context and consumed > > by the FS reclaim context to allow doing some provably save actions > > that would be skipped due to GFP_NOFS normally. I like this idea and > > I believe we can go that direction regardless of the approach taken here. > > Many filesystems simply need to cleanup their NOFS usage first before > > diving into a more complex changes.> > > This strikes me as over-engineering to work around an unnecessarily > burdensome interface.... but without details it is hard to be certain. > > Exactly what things happen in "FS reclaim context" which may, or may > not, be safe depending on the specific FS allocation context? Do they > need to happen at all? Let me quote Dave Chinner from one of the emails discussed at LSFMM mailing list: : IMO, making GFP_NOFS "better" cannot be done with context-less flags : being passed through reclaim. If we want to prevent the recursive : self-deadlock case in an optimal manner, then we need to be able to : pass state down to reclaim so that page writeback and the shrinkers : can determine if they are likely to deadlock. : : IOWs, I think we should stop thinking of GFP_NOFS as a *global* : directive to avoid recursion under any circumstance and instead : start thinking about it as a mechanism to avoid recursion in : specific reclaim contexts. : : Something as simple as adding an opaque cookie (e.g. can hold a : superblock or inode pointer) to check against in writeback and : subsystem shrinkers would result in the vast majority of GFP_NOFS : contexts being able to reclaim from everything but the one context : that we *might* deadlock against. : : e.g, if we then also check the PF_FSTRANS flag in XFS, we'll : still be able to reclaim clean inodes, buffers and write back : dirty pages that don't require transactions to complete under "don't : recurse" situations because we know it's transactions that we could : deadlock on in the direct reclaim context. : : Note that this information could be added to the writeback_control : for page writeback, and it could be passed directly to shrinkers : in the shrink_control structures. The allocation paths might be a : little harder, but I suspect using the task struct for passing this : information into direct reclaim might be the easiest approach... > My research suggests that for most filesystems the only thing that > happens in reclaim context that is at all troublesome is the final > 'evict()' on an inode. This needs to flush out dirty pages and sync the > inode to storage. Some time ago we moved most dirty-page writeout out > of the reclaim context and into kswapd. I think this was an excellent > advance in simplicity. > If we could similarly move evict() into kswapd (and I believe we can) > then most file systems would do nothing in reclaim context that > interferes with allocation context. > > The exceptions include: > - nfs and any filesystem using fscache can block for up to 1 second > in ->releasepage(). They used to block waiting for some IO, but that > caused deadlocks and wasn't really needed. I left the timeout because > it seemed likely that some throttling would help. I suspect that a > careful analysis will show that there is sufficient throttling > elsewhere. > > - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits > for IO so it can free some quotainfo things. If it could be changed > to just schedule the IO without waiting for it then I think this > would be safe to be called in any FS allocation context. It already > uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking > if the lock is held. > > I think you/we would end up with a much simpler system if instead of > focussing on the places where GFP_NOFS is used, we focus on places where > __GFP_FS is tested, and try to remove them. One think I have learned is that shrinkers can be really complex and getting rid of GFP_NOFS will be really hard so I would really like to start the easiest way possible and remove the direct usage and replace it by scope one which would at least _explain_ why it is needed. I think this is a reasonable _first_ step and a large step ahead because we have a good chance to get rid of a large number of those which were used "just because I wasn't sure and this should be safe, right?". I wouldn't be surprised if we end up with a very small number of both scope and direct usage in the end. I would also like to revisit generic inode/dentry shrinker and see whether it could be more __GFP_FS friendly. As you say many FS might even not depend on some FS internal locks so pushing GFP_FS check down the layers might make a lot of sense and allow to clean some [id]cache even for __GFP_FS context. > If we get rid of enough of them the remainder could just use __GFP_IO. > > > The patch 2 is a debugging aid which warns about explicit allocation > > requests from the scope context. This is should help to reduce the > > direct usage of the NOFS flags to bare minimum in favor of the scope > > API. It is not aimed to be merged upstream. I would hope Andrew took it > > into mmotm tree to give it linux-next exposure and allow developers to > > do further cleanups. There is a new kernel command line parameter which > > has to be used for the debugging to be enabled. > > > > I think the GFP_NOIO should be seeing the same clean up. > > I think you are suggesting that use of GFP_NOIO should (largely) be > deprecated in favour of memalloc_noio_save(). I think I agree. Yes that was the idea. > Could we go a step further and deprecate GFP_ATOMIC in favour of some > in_atomic() test? Maybe that is going too far. I am not really sure we need that and some GFP_NOWAIT usage is deliberate to perform an optimistic allocation with another fallback (e.g. higher order for performance reasons with single page fallback). So I think that nowait is a slightly different thing. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html