Re: sporadic hangs on generic/186

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Thu, 7 Apr 2022 13:01:42 +0000

On Thu, 2022-04-07 at 14:11 +1000, NeilBrown wrote:
> On Thu, 07 Apr 2022, Trond Myklebust wrote:
> > On Thu, 2022-04-07 at 11:19 +1000, NeilBrown wrote:
> > > On Thu, 07 Apr 2022, NeilBrown wrote:
> > > > On Thu, 07 Apr 2022, Dave Chinner wrote:
> > > > > On Wed, Apr 06, 2022 at 03:54:24PM -0400, J. Bruce Fields
> > > > > wrote:
> > > > > > In the last couple days I've started getting hangs on
> > > > > > xfstests
> > > > > > generic/186 on upstream.  I also notice the test completes
> > > > > > after 10+
> > > > > > hours (usually it takes about 5 minutes).  Sometimes this
> > > > > > is
> > > > > > accompanied
> > > > > > by "nfs: RPC call returned error 12" on the client.
> > > > > 
> > > > > #define ENOMEM          12      /* Out of memory */
> > > > > 
> > > > > So either the client or the server is running out of memory
> > > > > somewhere?
> > > > 
> > > > Probably the client.  There are a bunch of changes recently
> > > > which
> > > > add
> > > > __GFP_NORETRY to memory allocations from PF_WQ_WORKERs because
> > > > that
> > > > can
> > > > result in deadlocks when swapping over NFS.
> > > > This means that kmalloc request that previously never failed
> > > > (because
> > > > GFP_KERNEL never fails for kernel threads I think) can now
> > > > fail. 
> > > > This
> > > > has tickled one bug that I know of.  There are likely to be
> > > > more.
> > > > 
> > > > The RPC code should simply retry these allocations after a
> > > > short
> > > > delay. 
> > > > HZ/4 is the number that is used in a couple of places. 
> > > > Possibly
> > > > there
> > > > are more places that need to handle -ENOMEM with rpc_delay().
> > > 
> > > I had a look through the various places where alloc can now fail.
> > > 
> > > I think xdr_alloc_bvec() in xprt_sent_pagedata() is the most
> > > likely
> > > cause of a problem here.  I don't think an -ENOMEM from there is
> > > caught,
> > > so it could likely filter up to NFS and result in the message you
> > > got.
> > > 
> > > I don't think we can easily handle failure there.  We need to
> > > stay
> > > with
> > > GFP_KERNEL rely on PF_MEMALLOC to make forward progress for
> > > swap-over-NFS.
> > > 
> > > Bruce: can you change that one line back to GFP_KERNEL and see if
> > > the
> > > problem goes away?
> > > 
> > 
> > Can we please just move the call to xdr_alloc_bvec() out of
> > xprt_send_pagedata()? Move the client side allocation into
> > xs_stream_prepare_request() and xs_udp_send_request(), then move
> > the
> > server side allocation into svc_udp_sendto().
> > 
> > That makes it possible to handle errors.
> 
> Like the below I guess.  Seems sensible, but I don't know the code
> well
> enough to really review it.
> 
> > 
> > > The other problem I found is that rpc_alloc_task() can now fail,
> > > but
> > > rpc_new_task assumes that it never will.  If it does, then we get
> > > a
> > > NULL
> > > deref.
> > > 
> > > I don't think rpc_new_task() can ever be called from the rpciod
> > > work
> > > queue, so it is safe to just use a mempool with GFP_KERNEL like
> > > we
> > > did
> > > before. 
> > > 
> > No. We shouldn't ever use mempools with GFP_KERNEL.
> 
> Why not?  mempools with GFP_KERNEL make perfect sense outside of the
> rpciod and nfsiod threads.

If you can afford to make it an infinite wait, there is __GFP_NOFAIL,
so why waste the resources of an emergency pool? In my opinion,
however, an infinite uninterruptible sleep is bad policy for almost all
cases because someone will want to break out at some point.

> 
> > 
> > Most, if not all of the callers of rpc_run_task() are still capable
> > of
> > dealing with errors, and ditto for the callers of
> > rpc_run_bc_task().
> 
> Yes, they can deal with errors.  But in many cases that do so by
> passing
> the error up the call stack so we could start getting ENOMEM for
> systemcalls like stat().  I don't think that is a good idea.
> 

stat() has always been capable of returning ENOMEM if, for instance,
inode allocation fails. There are almost no calls in NFS (or most other
filesystems for that matter) that can't fail somehow when memory starts
to get really scarce.

The bottom line is that we use ordinary GFP_KERNEL memory allocations
where we can. The new code follows that rule, breaking it only in cases
where the specific rules of rpciod/xprtiod/nfsiod make it impossible to
wait forever in the memory manager.

I am preparing a set of patches to address the issues that you've
identified, plus a case in call_transmit_status/call_bc_transmit_status
where we're not handling ENOMEM. There are also patches to fix up two
cases in the NFS code itself where we're not currently handling errors
from rpc_run_task.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx