On Tue, 2022-05-31 at 12:03 -0400, Olga Kornievskaia wrote: > On Tue, May 31, 2022 at 11:00 AM Trond Myklebust > <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > On Tue, 2022-05-31 at 09:48 -0400, Olga Kornievskaia wrote: > > > From: Olga Kornievskaia <kolga@xxxxxxxxxx> > > > > > > In recent pnfs testing we've incountered IO thread starvation > > > problem > > > during the time when the server returns LAYOUTUNAVAILABLE error > > > to > > > the > > > client. When that happens each IO request tries to get a new > > > layout > > > and the pnfs_update_layout() code ensures that only 1 LAYOUTGET > > > RPC is outstanding, the rest would be waiting. As the thread that > > > gets > > > the layout wakes up the waiters only one gets to run and it tends > > > to > > > be > > > the latest added to the waiting queue. After receiving > > > LAYOUTUNAVAILABLE > > > error the client would fall back to the MDS writes and as those > > > writes > > > complete and the new write is issued, those requests are added as > > > waiters and they get to run before the earliest of the waiters > > > that > > > was put on the queue originally never gets to run until the > > > LAYOUTUNAVAILABLE condition resolves itself on the server. > > > > > > With the current code, if N IOs arrive asking for a layout, then > > > there will be N serial LAYOUTGETs that will follow, each would be > > > getting its own LAYOUTUNAVAILABLE error. Instead, the patch > > > proposes > > > to apply the error condition to ALL the waiters for the > > > outstanding > > > LAYOUTGET. Once the error is received, the code would allow all > > > exiting N IOs fall back to the MDS, but any new arriving IOs > > > would be > > > then queued up and one them the new IO would trigger a new > > > LAYOUTGET. > > > > > > Signed-off-by: Olga Kornievskaia <kolga@xxxxxxxxxx> > > > --- > > > fs/nfs/pnfs.c | 14 +++++++++++++- > > > fs/nfs/pnfs.h | 2 ++ > > > 2 files changed, 15 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c > > > index 68a87be3e6f9..5b7a679e32c8 100644 > > > --- a/fs/nfs/pnfs.c > > > +++ b/fs/nfs/pnfs.c > > > @@ -2028,10 +2028,20 @@ pnfs_update_layout(struct inode *ino, > > > if ((list_empty(&lo->plh_segs) || > > > !pnfs_layout_is_valid(lo)) > > > && > > > atomic_read(&lo->plh_outstanding) != 0) { > > > spin_unlock(&ino->i_lock); > > > + atomic_inc(&lo->plh_waiting); > > > lseg = ERR_PTR(wait_var_event_killable(&lo- > > > > plh_outstanding, > > > !atomic_read(&lo- > > > > plh_outstanding))); > > > - if (IS_ERR(lseg)) > > > + if (IS_ERR(lseg)) { > > > + atomic_dec(&lo->plh_waiting); > > > goto out_put_layout_hdr; > > > + } > > > + if (test_bit(NFS_LAYOUT_DRAIN, &lo->plh_flags)) { > > > + pnfs_layout_clear_fail_bit(lo, > > > pnfs_iomode_to_fail_bit(iomode)); By the way: this call to pnfs_layout_clear_fail_bit() would break any future fix to NFS4ERR_LAYOUTUNAVAILABLE. It's not the right thing to do here. > > > + lseg = NULL; > > > + if (atomic_dec_and_test(&lo- > > > >plh_waiting)) > > > + clear_bit(NFS_LAYOUT_DRAIN, &lo- > > > > plh_flags); > > > + goto out_put_layout_hdr; > > > + } > > > pnfs_put_layout_hdr(lo); > > > goto lookup_again; > > > } > > > @@ -2152,6 +2162,8 @@ pnfs_update_layout(struct inode *ino, > > > case -ERECALLCONFLICT: > > > case -EAGAIN: > > > break; > > > + case -ENODATA: > > > + set_bit(NFS_LAYOUT_DRAIN, &lo- > > > >plh_flags); > > > default: > > > if (!nfs_error_is_fatal(PTR_ERR(lseg))) { > > > pnfs_layout_clear_fail_bit(lo, > > > pnfs_iomode_to_fail_bit(iomode)); > > > diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h > > > index 07f11489e4e9..5c07da32320b 100644 > > > --- a/fs/nfs/pnfs.h > > > +++ b/fs/nfs/pnfs.h > > > @@ -105,6 +105,7 @@ enum { > > > NFS_LAYOUT_FIRST_LAYOUTGET, /* Serialize first > > > layoutget > > > */ > > > NFS_LAYOUT_INODE_FREEING, /* The inode is being > > > freed > > > */ > > > NFS_LAYOUT_HASHED, /* The layout visible */ > > > + NFS_LAYOUT_DRAIN, > > > }; > > > > > > enum layoutdriver_policy_flags { > > > @@ -196,6 +197,7 @@ struct pnfs_commit_ops { > > > struct pnfs_layout_hdr { > > > refcount_t plh_refcount; > > > atomic_t plh_outstanding; /* number of > > > RPCs > > > out */ > > > + atomic_t plh_waiting; > > > struct list_head plh_layouts; /* other client > > > layouts */ > > > struct list_head plh_bulk_destroy; > > > struct list_head plh_segs; /* layout segments > > > list */ > > > > According to the spec, the correct behaviour for handling > > NFS4ERR_LAYOUTUNAVAILABLE is to stop trying to do pNFS to the > > inode, > > and to fall back to doing I/O through the MDS. The error describes > > a > > more or less permanent state of the server being unable to hand out > > a > > layout for this file. > > If the server wanted the clients to retry after a delay, it should > > be > > returning NFS4ERR_LAYOUTTRYLATER. We already handle that correctly. > > To clarify, can you confirm that LAYOUTUNAVAILABLE would only turn > off > the inode permanently but for a period of time? See pnfs_layout_io_test_failed(). It automatically clears the fail bit after a period of PNFS_LAYOUTGET_RETRY_TIMEOUT (i.e. 120 seconds) > > It looks to me that for the LAYOUTTRYLATER, the client would face the > same starvation problem in the same situation. I don't see anything > marking the segment failed for such error? I believe the client > returns nolayout for that error, falls back to MDS but allows asking > for the layout for a period of time, having again the queue of > waiters > that gets manipulated as such that favors last added. > > > > Currently, what our client does to handle NFS4ERR_LAYOUTUNAVAILABLE > > is > > just plain wrong: we just return no layout, and then let the next > > caller to pnfs_update_layout() immediately try again. > > > > My problem with this patch, is that it just falls back to doing I/O > > through the MDS for the writes that are already queued in > > pnfs_update_layout(). It perpetuates the current bad behaviour of > > unnecessary pounding of the server with LAYOUTGET requests that are > > going to fail with the exact same error. > > > > I'd therefore prefer to see us fix the real bug (i.e. the handling > > of > > NFS4ERR_LAYOUTUNAVAILABLE) first, and then look at mitigating > > issues > > with the queuing. I already have 2 patches to deal with this. > > > > -- > > Trond Myklebust > > Linux NFS client maintainer, Hammerspace > > trond.myklebust@xxxxxxxxxxxxxxx > > > > -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx