On Thu, Mar 28, 2024 at 05:31:02PM -0700, Dai Ngo wrote: > > On 3/28/24 11:14 AM, Dai Ngo wrote: > > > > On 3/28/24 7:08 AM, Chuck Lever wrote: > > > On Wed, Mar 27, 2024 at 06:09:28PM -0700, Dai Ngo wrote: > > > > On 3/26/24 11:27 AM, Chuck Lever wrote: > > > > > On Tue, Mar 26, 2024 at 11:13:29AM -0700, Dai Ngo wrote: > > > > > > Currently when a nfs4_client is destroyed we wait for > > > > > > the cb_recall_any > > > > > > callback to complete before proceed. This adds > > > > > > unnecessary delay to the > > > > > > __destroy_client call if there is problem communicating > > > > > > with the client. > > > > > By "unnecessary delay" do you mean only the seven-second RPC > > > > > retransmit timeout, or is there something else? > > > > when the client network interface is down, the RPC task takes ~9s to > > > > send the callback, waits for the reply and gets ETIMEDOUT. This process > > > > repeats in a loop with the same RPC task before being stopped by > > > > rpc_shutdown_client after client lease expires. > > > I'll have to review this code again, but rpc_shutdown_client > > > should cause these RPCs to terminate immediately and safely. Can't > > > we use that? > > > > rpc_shutdown_client works, it terminated the RPC call to stop the loop. > > > > > > > > > > > > It takes a total of about 1m20s before the CB_RECALL is terminated. > > > > For CB_RECALL_ANY and CB_OFFLOAD, this process gets in to a infinite > > > > loop since there is no delegation conflict and the client is allowed > > > > to stay in courtesy state. > > > > > > > > The loop happens because in nfsd4_cb_sequence_done if cb_seq_status > > > > is 1 (an RPC Reply was never received) it calls nfsd4_mark_cb_fault > > > > to set the NFSD4_CB_FAULT bit. It then sets cb_need_restart to true. > > > > When nfsd4_cb_release is called, it checks cb_need_restart bit and > > > > re-queues the work again. > > > Something in the sequence_done path should check if the server is > > > tearing down this callback connection. If it doesn't, that is a bug > > > IMO. > > TCP terminated the connection after retrying for 16 minutes and > notified the RPC layer which deleted the nfsd4_conn. The server should have closed this connection already. Is it stuck waiting for the client to respond to a FIN or something? > But when nfsd4_run_cb_work ran again, it got into the infinite > loop caused by: > /* > * XXX: Ideally, we could wait for the client to > * reconnect, but I haven't figured out how > * to do that yet. > */ > nfsd4_queue_cb_delayed(cb, 25); > > which was introduced by c1ccfcf1a9bf. Note that I'm using 6.9-rc1. The whole paragraph is: 1503 clnt = clp->cl_cb_client; 1504 if (!clnt) { 1505 if (test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) 1506 nfsd41_destroy_cb(cb); 1507 else { 1508 /* 1509 * XXX: Ideally, we could wait for the client to 1510 * reconnect, but I haven't figured out how 1511 * to do that yet. 1512 */ 1513 nfsd4_queue_cb_delayed(cb, 25); 1514 } 1515 return; 1516 } When there's no rpc_clnt and CB_KILL is set, the callback operation should be destroyed immediately. CB_KILL is set by nfsd4_shutdown_callback. It's only caller is __destroy_client(). Why isn't NFSD4_CLIENT_CB_KILL set? -- Chuck Lever