On Tue, May 31, 2022 at 11:00 AM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Tue, 2022-05-31 at 09:48 -0400, Olga Kornievskaia wrote: > > From: Olga Kornievskaia <kolga@xxxxxxxxxx> > > > > In recent pnfs testing we've incountered IO thread starvation problem > > during the time when the server returns LAYOUTUNAVAILABLE error to > > the > > client. When that happens each IO request tries to get a new layout > > and the pnfs_update_layout() code ensures that only 1 LAYOUTGET > > RPC is outstanding, the rest would be waiting. As the thread that > > gets > > the layout wakes up the waiters only one gets to run and it tends to > > be > > the latest added to the waiting queue. After receiving > > LAYOUTUNAVAILABLE > > error the client would fall back to the MDS writes and as those > > writes > > complete and the new write is issued, those requests are added as > > waiters and they get to run before the earliest of the waiters that > > was put on the queue originally never gets to run until the > > LAYOUTUNAVAILABLE condition resolves itself on the server. > > > > With the current code, if N IOs arrive asking for a layout, then > > there will be N serial LAYOUTGETs that will follow, each would be > > getting its own LAYOUTUNAVAILABLE error. Instead, the patch proposes > > to apply the error condition to ALL the waiters for the outstanding > > LAYOUTGET. Once the error is received, the code would allow all > > exiting N IOs fall back to the MDS, but any new arriving IOs would be > > then queued up and one them the new IO would trigger a new LAYOUTGET. > > > > Signed-off-by: Olga Kornievskaia <kolga@xxxxxxxxxx> > > --- > > fs/nfs/pnfs.c | 14 +++++++++++++- > > fs/nfs/pnfs.h | 2 ++ > > 2 files changed, 15 insertions(+), 1 deletion(-) > > > > diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c > > index 68a87be3e6f9..5b7a679e32c8 100644 > > --- a/fs/nfs/pnfs.c > > +++ b/fs/nfs/pnfs.c > > @@ -2028,10 +2028,20 @@ pnfs_update_layout(struct inode *ino, > > if ((list_empty(&lo->plh_segs) || !pnfs_layout_is_valid(lo)) > > && > > atomic_read(&lo->plh_outstanding) != 0) { > > spin_unlock(&ino->i_lock); > > + atomic_inc(&lo->plh_waiting); > > lseg = ERR_PTR(wait_var_event_killable(&lo- > > >plh_outstanding, > > !atomic_read(&lo- > > >plh_outstanding))); > > - if (IS_ERR(lseg)) > > + if (IS_ERR(lseg)) { > > + atomic_dec(&lo->plh_waiting); > > goto out_put_layout_hdr; > > + } > > + if (test_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags)) { > > + pnfs_layout_clear_fail_bit(lo, > > pnfs_iomode_to_fail_bit(iomode)); > > + lseg = NULL; > > + if (atomic_dec_and_test(&lo->plh_waiting)) > > + clear_bit(NFS_LAYOUT_DRAIN, &lo- > > >plh_flags); > > + goto out_put_layout_hdr; > > + } > > pnfs_put_layout_hdr(lo); > > goto lookup_again; > > } > > @@ -2152,6 +2162,8 @@ pnfs_update_layout(struct inode *ino, > > case -ERECALLCONFLICT: > > case -EAGAIN: > > break; > > + case -ENODATA: > > + set_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags); > > default: > > if (!nfs_error_is_fatal(PTR_ERR(lseg))) { > > pnfs_layout_clear_fail_bit(lo, > > pnfs_iomode_to_fail_bit(iomode)); > > diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h > > index 07f11489e4e9..5c07da32320b 100644 > > --- a/fs/nfs/pnfs.h > > +++ b/fs/nfs/pnfs.h > > @@ -105,6 +105,7 @@ enum { > > NFS_LAYOUT_FIRST_LAYOUTGET, /* Serialize first layoutget > > */ > > NFS_LAYOUT_INODE_FREEING, /* The inode is being freed > > */ > > NFS_LAYOUT_HASHED, /* The layout visible */ > > + NFS_LAYOUT_DRAIN, > > }; > > > > enum layoutdriver_policy_flags { > > @@ -196,6 +197,7 @@ struct pnfs_commit_ops { > > struct pnfs_layout_hdr { > > refcount_t plh_refcount; > > atomic_t plh_outstanding; /* number of RPCs > > out */ > > + atomic_t plh_waiting; > > struct list_head plh_layouts; /* other client > > layouts */ > > struct list_head plh_bulk_destroy; > > struct list_head plh_segs; /* layout segments > > list */ > > According to the spec, the correct behaviour for handling > NFS4ERR_LAYOUTUNAVAILABLE is to stop trying to do pNFS to the inode, > and to fall back to doing I/O through the MDS. The error describes a > more or less permanent state of the server being unable to hand out a > layout for this file. > If the server wanted the clients to retry after a delay, it should be > returning NFS4ERR_LAYOUTTRYLATER. We already handle that correctly. To clarify, can you confirm that LAYOUTUNAVAILABLE would only turn off the inode permanently but for a period of time? It looks to me that for the LAYOUTTRYLATER, the client would face the same starvation problem in the same situation. I don't see anything marking the segment failed for such error? I believe the client returns nolayout for that error, falls back to MDS but allows asking for the layout for a period of time, having again the queue of waiters that gets manipulated as such that favors last added. > Currently, what our client does to handle NFS4ERR_LAYOUTUNAVAILABLE is > just plain wrong: we just return no layout, and then let the next > caller to pnfs_update_layout() immediately try again. > > My problem with this patch, is that it just falls back to doing I/O > through the MDS for the writes that are already queued in > pnfs_update_layout(). It perpetuates the current bad behaviour of > unnecessary pounding of the server with LAYOUTGET requests that are > going to fail with the exact same error. > > I'd therefore prefer to see us fix the real bug (i.e. the handling of > NFS4ERR_LAYOUTUNAVAILABLE) first, and then look at mitigating issues > with the queuing. I already have 2 patches to deal with this. > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@xxxxxxxxxxxxxxx > >