On Fri, 08 Apr 2022, Dave Chinner wrote: > On Fri, Apr 08, 2022 at 12:46:08PM +1000, NeilBrown wrote: > > On Thu, 07 Apr 2022, Trond Myklebust wrote: > > > The bottom line is that we use ordinary GFP_KERNEL memory allocations > > > where we can. The new code follows that rule, breaking it only in cases > > > where the specific rules of rpciod/xprtiod/nfsiod make it impossible to > > > wait forever in the memory manager. > > > > It is not safe to use GFP_KERNEL for an allocation that is needed in > > order to free memory - and so any allocation that is needed to write out > > data from the page cache. > > Except that same page cache writeback path can be called from > syscall context (e.g. fsync()) which has nothing to do with memory > reclaim. In that case GFP_KERNEL is the correct allocation context > to use because there are no constraints on what memory reclaim can > be performed from this path. > > IOWs, if the context initiating data writeback doesn't allow > GFP_KERNEL allocations, then it should be calling > memalloc_nofs_save() or memalloc_noio_save() to constrain all > allocations to the required context. We should not be requiring the > filesystem (or any other subsystem) to magically infer that the IO > is being done in a constrained allocation context and modify the > context they use appropriately. > > If we this, then all filesystems would simply use GFP_NOIO > everywhere because the loop device layers the entire filesystem IO > path under block device context (i.e. requiring GFP_NOIO allocation > context). We don't do this - the loop device sets PF_MEMALLOC_NOIO > instead so all allocations in that path run with at least GFP_NOIO > constraints and filesystems are none the wiser about the constraints > of the calling context. > > IOWs, GFP_KERNEL is generally right context to be using in > filesystem IO paths and callers need to restrict allocation contexts > via task flags if they cannot allow certain types of reclaim > recursion to occur... NOIO and NOFS are not the issue here. We all agree that memalloc_noXX_save() is the right thing to do. The issue is that memalloc can block indefinitely in low-memory situations, and any code that has to make progress in low-memory situations - like writeout - needs to be careful. This is why the block layer uses mempools for request headers etc - so that progress is guaranteed without depending on alloc_page() to succeed. File systems do often get away with using GFP_KERNEL because the important paths has PF_MEMALLOC and hence __GFP_MEMALLOC in effect and that provides access to some shared reserves. Shared reserves are risky - the other users you are sharing with might steal it all. File systems tend to survive anyway because there is a limit on the mount of dirty filesystem data - so there is lots of non-filesystem data around, and a good chance that some of that can be freed. I say "tend to" because I believe the is no real guarantee. It seems to actually work 99.999% of the time, and maybe that is enough. I suspect you might be able to deadlock filesystem writeout by memlocking lots of memory while there are lots of dirty pages. It probably wouldn't be easy though. swap-out is different. There is no limit the the amount of dirty anon data, so it is fairly easy to get a situation where you absolutely must write out some anon pages before alloc_page() can succeed. Most filesystems don't handle swap-out directly - they just tell the MM which disk addresses to use and submit_bio() is used for writing. The bio is allocated from a mempool, and nothing below submit_bio() uses GFP_KERNEL to alloc_page() - they all use mempools (or accept failure in some other way). A separate mempool at each level - they aren't shared (so they are quite different from __GFP_MEMALLOC). NFS is different. NFS handles swap using the same paths as other writes, so it is much more likely to face indefinite waits in alloc_page() - it least when handling swap. __GFP_MEMALLOC helps to a degree but there a lots of levels, and the more levels we have have local reserves (even if the mempool only reserves a single element), the better. The networking people refuse to use mempools (or at least, they did once some years ago) and I cannot entirely blame them as there are lots of moving parts - lots of different things that might need to be allocated (arp table entries?) but usually aren't. So for everything in the socket layer and below we rely on __GFP_MEMALLOC (and recommend increasing the reserves a bit above the default, just in case). But in NFS and particularly in SUNRPC we already have the mempool, and there is only a small number of things that we need to allocate to ensure forward progress. So using a mempool as designed, rather than relying on MEMALLOC reserves is the sensible thing to do, and leaves more of the reserves for the networking layer. In fact, the allocation that SUNRPC now does before trying a mempool should really be using __GFP_NOMEMALLOC so that they don't take from the shared reserves (even when PF_MEMALLOC is set). As it has a private reserve (the mempool) it should leave the common reserve for other layers (sockets etc). NeilBrown