Re: [PATCH] pNFS: fix IO thread starvation problem during LAYOUTUNAVAILABLE error

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 31 May 2022 16:14:54 +0000

On Tue, 2022-05-31 at 12:03 -0400, Olga Kornievskaia wrote:
> On Tue, May 31, 2022 at 11:00 AM Trond Myklebust
> <trondmy@xxxxxxxxxxxxxxx> wrote:
> > 
> > On Tue, 2022-05-31 at 09:48 -0400, Olga Kornievskaia wrote:
> > > From: Olga Kornievskaia <kolga@xxxxxxxxxx>
> > > 
> > > In recent pnfs testing we've incountered IO thread starvation
> > > problem
> > > during the time when the server returns LAYOUTUNAVAILABLE error
> > > to
> > > the
> > > client. When that happens each IO request tries to get a new
> > > layout
> > > and the pnfs_update_layout() code ensures that only 1 LAYOUTGET
> > > RPC is outstanding, the rest would be waiting. As the thread that
> > > gets
> > > the layout wakes up the waiters only one gets to run and it tends
> > > to
> > > be
> > > the latest added to the waiting queue. After receiving
> > > LAYOUTUNAVAILABLE
> > > error the client would fall back to the MDS writes and as those
> > > writes
> > > complete and the new write is issued, those requests are added as
> > > waiters and they get to run before the earliest of the waiters
> > > that
> > > was put on the queue originally never gets to run until the
> > > LAYOUTUNAVAILABLE condition resolves itself on the server.
> > > 
> > > With the current code, if N IOs arrive asking for a layout, then
> > > there will be N serial LAYOUTGETs that will follow, each would be
> > > getting its own LAYOUTUNAVAILABLE error. Instead, the patch
> > > proposes
> > > to apply the error condition to ALL the waiters for the
> > > outstanding
> > > LAYOUTGET. Once the error is received, the code would allow all
> > > exiting N IOs fall back to the MDS, but any new arriving IOs
> > > would be
> > > then queued up and one them the new IO would trigger a new
> > > LAYOUTGET.
> > > 
> > > Signed-off-by: Olga Kornievskaia <kolga@xxxxxxxxxx>
> > > ---
> > >  fs/nfs/pnfs.c | 14 +++++++++++++-
> > >  fs/nfs/pnfs.h |  2 ++
> > >  2 files changed, 15 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> > > index 68a87be3e6f9..5b7a679e32c8 100644
> > > --- a/fs/nfs/pnfs.c
> > > +++ b/fs/nfs/pnfs.c
> > > @@ -2028,10 +2028,20 @@ pnfs_update_layout(struct inode *ino,
> > >         if ((list_empty(&lo->plh_segs) ||
> > > !pnfs_layout_is_valid(lo))
> > > &&
> > >             atomic_read(&lo->plh_outstanding) != 0) {
> > >                 spin_unlock(&ino->i_lock);
> > > +               atomic_inc(&lo->plh_waiting);
> > >                 lseg = ERR_PTR(wait_var_event_killable(&lo-
> > > > plh_outstanding,
> > >                                         !atomic_read(&lo-
> > > > plh_outstanding)));
> > > -               if (IS_ERR(lseg))
> > > +               if (IS_ERR(lseg)) {
> > > +                       atomic_dec(&lo->plh_waiting);
> > >                         goto out_put_layout_hdr;
> > > +               }
> > > +               if (test_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags)) {
> > > +                       pnfs_layout_clear_fail_bit(lo,
> > > pnfs_iomode_to_fail_bit(iomode));

By the way: this call to pnfs_layout_clear_fail_bit() would break any
future fix to NFS4ERR_LAYOUTUNAVAILABLE. It's not the right thing to do
here.

> > > +                       lseg = NULL;
> > > +                       if (atomic_dec_and_test(&lo-
> > > >plh_waiting))
> > > +                               clear_bit(NFS_LAYOUT_DRAIN, &lo-
> > > > plh_flags);
> > > +                       goto out_put_layout_hdr;
> > > +               }
> > >                 pnfs_put_layout_hdr(lo);
> > >                 goto lookup_again;
> > >         }
> > > @@ -2152,6 +2162,8 @@ pnfs_update_layout(struct inode *ino,
> > >                 case -ERECALLCONFLICT:
> > >                 case -EAGAIN:
> > >                         break;
> > > +               case -ENODATA:
> > > +                       set_bit(NFS_LAYOUT_DRAIN, &lo-
> > > >plh_flags);
> > >                 default:
> > >                         if (!nfs_error_is_fatal(PTR_ERR(lseg))) {
> > >                                 pnfs_layout_clear_fail_bit(lo,
> > > pnfs_iomode_to_fail_bit(iomode));
> > > diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> > > index 07f11489e4e9..5c07da32320b 100644
> > > --- a/fs/nfs/pnfs.h
> > > +++ b/fs/nfs/pnfs.h
> > > @@ -105,6 +105,7 @@ enum {
> > >         NFS_LAYOUT_FIRST_LAYOUTGET,     /* Serialize first
> > > layoutget
> > > */
> > >         NFS_LAYOUT_INODE_FREEING,       /* The inode is being
> > > freed
> > > */
> > >         NFS_LAYOUT_HASHED,              /* The layout visible */
> > > +       NFS_LAYOUT_DRAIN,
> > >  };
> > > 
> > >  enum layoutdriver_policy_flags {
> > > @@ -196,6 +197,7 @@ struct pnfs_commit_ops {
> > >  struct pnfs_layout_hdr {
> > >         refcount_t              plh_refcount;
> > >         atomic_t                plh_outstanding; /* number of
> > > RPCs
> > > out */
> > > +       atomic_t                plh_waiting;
> > >         struct list_head        plh_layouts;   /* other client
> > > layouts */
> > >         struct list_head        plh_bulk_destroy;
> > >         struct list_head        plh_segs;      /* layout segments
> > > list */
> > 
> > According to the spec, the correct behaviour for handling
> > NFS4ERR_LAYOUTUNAVAILABLE is to stop trying to do pNFS to the
> > inode,
> > and to fall back to doing I/O through the MDS. The error describes
> > a
> > more or less permanent state of the server being unable to hand out
> > a
> > layout for this file.
> > If the server wanted the clients to retry after a delay, it should
> > be
> > returning NFS4ERR_LAYOUTTRYLATER. We already handle that correctly.
> 
> To clarify, can you confirm that LAYOUTUNAVAILABLE would only turn
> off
> the inode permanently but for a period of time?

See pnfs_layout_io_test_failed(). It automatically clears the fail bit
after a period of PNFS_LAYOUTGET_RETRY_TIMEOUT (i.e. 120 seconds)

> 
> It looks to me that for the LAYOUTTRYLATER, the client would face the
> same starvation problem in the same situation. I don't see anything
> marking the segment failed for such error? I believe the client
> returns nolayout for that error, falls back to MDS but allows asking
> for the layout for a period of time, having again the queue of
> waiters
> that gets manipulated as such that favors last added.
> 
> 
> > Currently, what our client does to handle NFS4ERR_LAYOUTUNAVAILABLE
> > is
> > just plain wrong: we just return no layout, and then let the next
> > caller to pnfs_update_layout() immediately try again.
> > 
> > My problem with this patch, is that it just falls back to doing I/O
> > through the MDS for the writes that are already queued in
> > pnfs_update_layout(). It perpetuates the current bad behaviour of
> > unnecessary pounding of the server with LAYOUTGET requests that are
> > going to fail with the exact same error.
> > 
> > I'd therefore prefer to see us fix the real bug (i.e. the handling
> > of
> > NFS4ERR_LAYOUTUNAVAILABLE) first, and then look at mitigating
> > issues
> > with the queuing. I already have 2 patches to deal with this.
> > 
> > --
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > trond.myklebust@xxxxxxxxxxxxxxx
> > 
> > 

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx