Re: sporadic hangs on generic/186

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 11 Apr 2022 09:34:15 +1000

On Fri, Apr 08, 2022 at 03:32:38PM +1000, NeilBrown wrote:
> On Fri, 08 Apr 2022, Dave Chinner wrote:
> > On Fri, Apr 08, 2022 at 12:46:08PM +1000, NeilBrown wrote:
> > > On Thu, 07 Apr 2022, Trond Myklebust wrote:
> > > > The bottom line is that we use ordinary GFP_KERNEL memory allocations
> > > > where we can. The new code follows that rule, breaking it only in cases
> > > > where the specific rules of rpciod/xprtiod/nfsiod make it impossible to
> > > > wait forever in the memory manager.
> > > 
> > > It is not safe to use GFP_KERNEL for an allocation that is needed in
> > > order to free memory - and so any allocation that is needed to write out
> > > data from the page cache.
> > 
> > Except that same page cache writeback path can be called from
> > syscall context (e.g.  fsync()) which has nothing to do with memory
> > reclaim. In that case GFP_KERNEL is the correct allocation context
> > to use because there are no constraints on what memory reclaim can
> > be performed from this path.
> > 
> > IOWs, if the context initiating data writeback doesn't allow
> > GFP_KERNEL allocations, then it should be calling
> > memalloc_nofs_save() or memalloc_noio_save() to constrain all
> > allocations to the required context. We should not be requiring the
> > filesystem (or any other subsystem) to magically infer that the IO
> > is being done in a constrained allocation context and modify the
> > context they use appropriately.
> > 
> > If we this, then all filesystems would simply use GFP_NOIO
> > everywhere because the loop device layers the entire filesystem IO
> > path under block device context (i.e. requiring GFP_NOIO allocation
> > context). We don't do this - the loop device sets PF_MEMALLOC_NOIO
> > instead so all allocations in that path run with at least GFP_NOIO
> > constraints and filesystems are none the wiser about the constraints
> > of the calling context.
> > 
> > IOWs, GFP_KERNEL is generally right context to be using in
> > filesystem IO paths and callers need to restrict allocation contexts
> > via task flags if they cannot allow certain types of reclaim
> > recursion to occur...
> 
> NOIO and NOFS are not the issue here.  We all agree that
> memalloc_noXX_save() is the right thing to do.
> 
> The issue is that memalloc can block indefinitely in low-memory
> situations, and any code that has to make progress in low-memory
> situations - like writeout - needs to be careful.

Yup, and you've missed my point entirely, then explained exactly why
high level memory allocation context needs to be set by mempool
based allocations...

> The bio is allocated from a mempool, and nothing below submit_bio() uses
> GFP_KERNEL to alloc_page() - they all use mempools (or accept failure in
> some other way).  A separate mempool at each level - they aren't shared
> (so they are quite different from __GFP_MEMALLOC).

.... because this house of cards using mempools only works if it
is nested mempools all the way down.

> The networking people refuse to use mempools (or at least, they did once

Same as many filesystem people refuse to use mempools, because the
page writeback IO path in a filesystem can have *unbound* memory
demand. mempools just don't work in filesytsems that need to run
transactions or read metadata (e.g. for block allocation
transactions).  We've been through this many times before; it's why
filesystems like XFS and Btrfs do not allow ->writepage from memory
reclaim contexts.

> But in NFS and particularly in SUNRPC we already have the mempool, and
> there is only a small number of things that we need to allocate to
> ensure forward progress.  So using a mempool as designed, rather than
> relying on MEMALLOC reserves is the sensible thing to do, and leaves
> more of the reserves for the networking layer.

Except the mempool now requires everything in the path below it not
to block on memory allocation for it to guarantee forwards progress
i.e. nested allocations need to succeed or error out - either
guarantees *mempool* forwards progress, not necessarily forwards
progress cleaning memory, but this is only needed for the paths
where we actually need to provide a forwards progress guarantee.

THe RPC code puts the entire network stack under a mempool, and as
you say the network stack is resistent to using mempools.  At which
point, NFS needs memalloc_noreclaim_save() context to be set
explicitly in the path nested inside the mempool allocation so that
non-mempool allocations don't block trying to reclaim memory that
might never come available....

That's the point I was trying to make - GFP_KERNEL is not the
problem here - it's that mempools only work when it's mempools all
the way down. Individual code paths and/or allocations may have no
idea that they are running in a nested mempool allocation context,
and so all allocations in that path - GFP_KERNEL or otherwise - need
to be automatically converted to fail-instead-of-block semantics by
the high level code that uses a mempool based allocation...

> In fact, the allocation that SUNRPC now does before trying a mempool
> should really be using __GFP_NOMEMALLOC so that they don't take from the
> shared reserves (even when PF_MEMALLOC is set).  As it has a private
> reserve (the mempool) it should leave the common reserve for other
> layers (sockets etc).

Yup, but that's a high level, pre-mempool allocation optimisation
which is not useful to future allocations done inside the mempool
forwards progress guarantee context....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx