On Fri, 2023-12-15 at 12:18 -0800, dai.ngo@xxxxxxxxxx wrote: > On 12/15/23 11:54 AM, Jeff Layton wrote: > > On Fri, 2023-12-15 at 11:15 -0800, Dai Ngo wrote: > > > If the callback workqueue is stuck, nfsd4_deleg_getattr_conflict will > > > also stuck waiting for the callback request to be executed. This causes > > > the client to hang waiting for the reply of the GETATTR and also causes > > > the reboot of the NFS server to hang due to the pending NFS request. > > > > > > Fix by replacing wait_on_bit with wait_on_bit_timeout with 20 seconds > > > time out. > > > > > > Reported-by: Wolfgang Walter <linux-nfs@xxxxxxx> > > > Fixes: 6c41d9a9bd02 ("NFSD: handle GETATTR conflict with write delegation") > > > Signed-off-by: Dai Ngo <dai.ngo@xxxxxxxxxx> > > > --- > > > fs/nfsd/nfs4state.c | 6 +++++- > > > fs/nfsd/state.h | 2 ++ > > > 2 files changed, 7 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c > > > index 175f3e9f5822..0cc7d4953807 100644 > > > --- a/fs/nfsd/nfs4state.c > > > +++ b/fs/nfsd/nfs4state.c > > > @@ -2948,6 +2948,9 @@ void nfs4_cb_getattr(struct nfs4_cb_fattr *ncf) > > > if (test_and_set_bit(CB_GETATTR_BUSY, &ncf->ncf_cb_flags)) > > > return; > > > > > > + /* set to proper status when nfsd4_cb_getattr_done runs */ > > > + ncf->ncf_cb_status = NFS4ERR_IO; > > > + > > > refcount_inc(&dp->dl_stid.sc_count); > > > if (!nfsd4_run_cb(&ncf->ncf_getattr)) { > > > refcount_dec(&dp->dl_stid.sc_count); > > > @@ -8558,7 +8561,8 @@ nfsd4_deleg_getattr_conflict(struct svc_rqst *rqstp, struct inode *inode, > > > nfs4_cb_getattr(&dp->dl_cb_fattr); > > > spin_unlock(&ctx->flc_lock); > > > > > > - wait_on_bit(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, TASK_INTERRUPTIBLE); > > > + wait_on_bit_timeout(&ncf->ncf_cb_flags, CB_GETATTR_BUSY, > > > + TASK_INTERRUPTIBLE, NFSD_CB_GETATTR_TIMEOUT); > > The RPC won't necessarily have timed out at this point, and it looks > > like ncf_cb_status won't have been set to anything (and is probably > > still 0?). > > The timeout was added to handle the case where the callback request > did not get queued to the workqueue; nfsd4_run_cb fails. In this case > RPC is not involved and we don't want to hang here. Note that this patch > sets ncf_cb_status to NFS4ERR_IO before calling nfsd4_run_cb so we can > detect this error condition. > Ok, I missed that bit, thanks. > > > > Don't you need to check whether the wait timed out or was successful? > > ncf_cb_status is set to tk_status by nfsd4_cb_getattr_done. If the request > was successful then ncf_cb_status is 0. > > > What happens now when this times out? > > Then we go through the normal logic of nfsd_open_break_lease which will > also get timed out but eventually the lease, delegation state, will be > removed by __break_lease after 45 secs (lease_break_time). > > -Dai > > > > > > > > if (ncf->ncf_cb_status) { > > > status = nfserrno(nfsd_open_break_lease(inode, NFSD_MAY_READ)); > > > if (status != nfserr_jukebox || > > > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h > > > index f96eaa8e9413..94563a6813a6 100644 > > > --- a/fs/nfsd/state.h > > > +++ b/fs/nfsd/state.h > > > @@ -135,6 +135,8 @@ struct nfs4_cb_fattr { > > > /* bits for ncf_cb_flags */ > > > #define CB_GETATTR_BUSY 0 > > > > > > +#define NFSD_CB_GETATTR_TIMEOUT msecs_to_jiffies(20000) /* 20 secs */ > > > + > > Why 20s? > > RPC will time out after 9 secs if it does not receive a callback reply. > This time out value needs to be greater than 9 secs. I just be generous > here, we can reduce it to any value > 9 secs. > > -Dai > > > > > > /* > > > * Represents a delegation stateid. The nfs4_client holds references to these > > > * and they are put when it is being destroyed or when the delegation is > > Given that: Reviewed-by: Jeff Layton <jlayton@xxxxxxxxxx>