Re: [PATCH 3/3] NFSD: Fix server reboot hang problem when callback workqueue is stuck

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Dec 15, 2023 at 12:40:07PM -0800, dai.ngo@xxxxxxxxxx wrote:
> 
> On 12/15/23 11:54 AM, Chuck Lever wrote:
> > On Fri, Dec 15, 2023 at 11:15:03AM -0800, Dai Ngo wrote:
> > > If the callback workqueue is stuck, nfsd4_deleg_getattr_conflict will
> > > also stuck waiting for the callback request to be executed. This causes
> > > the client to hang waiting for the reply of the GETATTR and also causes
> > > the reboot of the NFS server to hang due to the pending NFS request.
> > > 
> > > Fix by replacing wait_on_bit with wait_on_bit_timeout with 20 seconds
> > > time out.
> > > 
> > > Reported-by: Wolfgang Walter <linux-nfs@xxxxxxx>
> > > Fixes: 6c41d9a9bd02 ("NFSD: handle GETATTR conflict with write delegation")
> > > Signed-off-by: Dai Ngo <dai.ngo@xxxxxxxxxx>
> > > ---
> > >   fs/nfsd/nfs4state.c | 6 +++++-
> > >   fs/nfsd/state.h     | 2 ++
> > >   2 files changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > index 175f3e9f5822..0cc7d4953807 100644
> > > --- a/fs/nfsd/nfs4state.c
> > > +++ b/fs/nfsd/nfs4state.c
> > > @@ -2948,6 +2948,9 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf)
> > >   	if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags))
> > >   		return;
> > > +	/* set to proper status when nfsd4_cb_getattr_done runs */
> > > +	ncf->ncf_cb_status = NFS4ERR_IO;
> > > +
> > >   	refcount_inc(&dp->dl_stid.sc_count);
> > >   	if (!nfsd4_run_cb(&ncf->ncf_getattr)) {
> > >   		refcount_dec(&dp->dl_stid.sc_count);
> > > @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode,
> > >   			nfs4_cb_getattr(&dp->dl_cb_fattr);
> > >   			spin_unlock(&ctx->flc_lock);
> > > -			wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE);
> > > +			wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY,
> > > +				TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT);
> > I'm still thinking the timeout here should be the same (or slightly
> > longer than) the RPC retransmit timeout, rather than adding a new
> > NFSD_CB_GETATTR_TIMEOUT macro.
> 
> The NFSD_CB_GETATTR_TIMEOUT is used only when we can not submit a
> work item to the workqueue so RPC is not involved here.

In the "RPC was sent successfully" case, there is an implicit
assumption here that wait_on_bit_timeout() won't time out before the
actual RPC CB_GETATTR timeout.

You've chosen timeout values that happen to work, but there's
nothing in this patch that ties the two timeout values together or
in any other way documents this implicit assumption.


> We need to
> time out here to prevent the client (that causes the conflict) to
> hang waiting for the reply of the GETATTR and to prevent the server
> reboot to hang due to a pending NFS request.

Perhaps a better approach would be to not rely on a timeout, but
instead have nfs4_cb_getattr() wake up the bit wait before
returning, when it can't queue the work. That way, wait_on_bit()
will return immediately in that case.


> > >   			if (ncf->ncf_cb_status) {
> > >   				status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ));
> > >   				if (status != nfserr_jukebox ||
> > > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> > > index f96eaa8e9413..94563a6813a6 100644
> > > --- a/fs/nfsd/state.h
> > > +++ b/fs/nfsd/state.h
> > > @@ -135,6 +135,8 @@ struct nfs4_cb_fattr {
> > >   /* bits for ncf_cb_flags */
> > >   #define	CB_GETATTR_BUSY		0
> > > +#define	NFSD_CB_GETATTR_TIMEOUT	msecs_to_jiffies(20000) /* 20 secs */
> > > +
> > >   /*
> > >    * Represents a delegation stateid. The nfs4_client holds references to these
> > >    * and they are put when it is being destroyed or when the delegation is
> > > -- 
> > > 2.39.3
> > > 

-- 
Chuck Lever




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux