Re: Spurious instability with NFSoRDMA under moderate load

"J. Bruce Fields" <bfields@xxxxxxxxxxxx> · Thu, 12 Aug 2021 11:40:01 -0400

On Wed, Aug 11, 2021 at 04:40:04PM -0400, Olga Kornievskaia wrote:
> On Wed, Aug 11, 2021 at 4:14 PM J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
> >
> > On Wed, Aug 11, 2021 at 08:01:30PM +0000, Chuck Lever III wrote:
> > > Probably not just CB_RECALL, but agreed, there doesn't seem to
> > > be any mechanism that can re-drive callback operations when the
> > > backchannel is replaced.
> >
> > The nfsd4_queue_cb() in nfsd4_cb_release() should queue a work item
> > to run nfsd4_run_cb_work, which should set up another callback client if
> > necessary.

But I think the result is it'll look to see if there's another
connection available for callbacks, and give up immediately if not.

There's no logic to wait for the client to fix the problem.

> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
> index 7325592b456e..ed0e76f7185c 100644
> --- a/fs/nfsd/nfs4callback.c
> +++ b/fs/nfsd/nfs4callback.c
> @@ -1191,6 +1191,7 @@ static void nfsd4_cb_done(struct rpc_task *task,
> void *calldata)
>                 case -ETIMEDOUT:
>                 case -EACCES:
>                         nfsd4_mark_cb_down(clp, task->tk_status);
> +                       cb->cb_need_restart = true;
>                 }
>                 break;
>         default:
> 
> Something like this should requeue and retry the callback?

I think we'd need more than just that.

--b.