Re: [PATCH 1/1] NFSD: cancel CB_RECALL_ANY call when nfs4_client is about to be destroyed

Chuck Lever <chuck.lever@xxxxxxxxxx> · Fri, 29 Mar 2024 10:55:54 -0400

On Thu, Mar 28, 2024 at 05:31:02PM -0700, Dai Ngo wrote:
> 
> On 3/28/24 11:14 AM, Dai Ngo wrote:
> > 
> > On 3/28/24 7:08 AM, Chuck Lever wrote:
> > > On Wed, Mar 27, 2024 at 06:09:28PM -0700, Dai Ngo wrote:
> > > > On 3/26/24 11:27 AM, Chuck Lever wrote:
> > > > > On Tue, Mar 26, 2024 at 11:13:29AM -0700, Dai Ngo wrote:
> > > > > > Currently when a nfs4_client is destroyed we wait for
> > > > > > the cb_recall_any
> > > > > > callback to complete before proceed. This adds
> > > > > > unnecessary delay to the
> > > > > > __destroy_client call if there is problem communicating
> > > > > > with the client.
> > > > > By "unnecessary delay" do you mean only the seven-second RPC
> > > > > retransmit timeout, or is there something else?
> > > > when the client network interface is down, the RPC task takes ~9s to
> > > > send the callback, waits for the reply and gets ETIMEDOUT. This process
> > > > repeats in a loop with the same RPC task before being stopped by
> > > > rpc_shutdown_client after client lease expires.
> > > I'll have to review this code again, but rpc_shutdown_client
> > > should cause these RPCs to terminate immediately and safely. Can't
> > > we use that?
> > 
> > rpc_shutdown_client works, it terminated the RPC call to stop the loop.
> > 
> > > 
> > > 
> > > > It takes a total of about 1m20s before the CB_RECALL is terminated.
> > > > For CB_RECALL_ANY and CB_OFFLOAD, this process gets in to a infinite
> > > > loop since there is no delegation conflict and the client is allowed
> > > > to stay in courtesy state.
> > > > 
> > > > The loop happens because in nfsd4_cb_sequence_done if cb_seq_status
> > > > is 1 (an RPC Reply was never received) it calls nfsd4_mark_cb_fault
> > > > to set the NFSD4_CB_FAULT bit. It then sets cb_need_restart to true.
> > > > When nfsd4_cb_release is called, it checks cb_need_restart bit and
> > > > re-queues the work again.
> > > Something in the sequence_done path should check if the server is
> > > tearing down this callback connection. If it doesn't, that is a bug
> > > IMO.
> 
> TCP terminated the connection after retrying for 16 minutes and
> notified the RPC layer which deleted the nfsd4_conn.

The server should have closed this connection already. Is it stuck
waiting for the client to respond to a FIN or something?

> But when nfsd4_run_cb_work ran again, it got into the infinite
> loop caused by:
>      /*
>       * XXX: Ideally, we could wait for the client to
>       *      reconnect, but I haven't figured out how
>       *      to do that yet.
>       */
>       nfsd4_queue_cb_delayed(cb, 25);
> 
> which was introduced by c1ccfcf1a9bf. Note that I'm using 6.9-rc1.

The whole paragraph is:

1503         clnt = clp->cl_cb_client;
1504         if (!clnt) {
1505                 if (test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags))
1506                         nfsd41_destroy_cb(cb);
1507                 else {
1508                         /*
1509                          * XXX: Ideally, we could wait for the client to
1510                          *      reconnect, but I haven't figured out how
1511                          *      to do that yet.
1512                          */
1513                         nfsd4_queue_cb_delayed(cb, 25);
1514                 }
1515                 return;
1516         }

When there's no rpc_clnt and CB_KILL is set, the callback
operation should be destroyed immediately. CB_KILL is set by
nfsd4_shutdown_callback. It's only caller is
__destroy_client().

Why isn't NFSD4_CLIENT_CB_KILL set?

-- 
Chuck Lever