> On Jan 25, 2024, at 5:09 PM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: > > On 25 Jan 2024, at 11:28, Chuck Lever wrote: > >> The first three patches fix bugs that prevent NFSD's backchannel >> from reliably retransmitting after a client reconnects. These fixes >> might be appropriate for 6.8-rc. >> >> Following that are some new trace points that might be helpful for >> field troubleshooting. >> >> Then there are some minor clean-ups. >> >> I am still testing this series, and there is one msleep() call that >> needs some thought. Thoughts, comments, opinions, rotten fruit? You >> know the drill. >> >> --- >> >> Chuck Lever (13): >> NFSD: Reset cb_seq_status after NFS4ERR_DELAY >> NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down >> NFSD: Retransmit callbacks after client reconnects >> NFSD: Add nfsd_seq4_status trace event >> NFSD: Replace dprintks in nfsd4_cb_sequence_done() >> NFSD: Rename nfsd_cb_state trace point >> NFSD: Add callback operation lifetime trace points >> SUNRPC: Remove EXPORT_SYMBOL_GPL for svc_process_bc() >> NFSD: Remove unused @reason argument >> NFSD: Replace comment with lockdep assertion >> NFSD: Remove BUG_ON in nfsd4_process_cb_update() >> SUNRPC: Remove stale comments >> NFSD: Remove redundant cb_seq_status initialization >> >> >> fs/nfsd/nfs4callback.c | 81 +++++++++++++------- >> fs/nfsd/nfs4state.c | 1 + >> fs/nfsd/trace.h | 162 ++++++++++++++++++++++++++++++++++++++- >> include/trace/misc/nfs.h | 34 ++++++++ >> net/sunrpc/svc.c | 1 - >> net/sunrpc/xprtsock.c | 9 --- >> 6 files changed, 250 insertions(+), 38 deletions(-) > > > These are great, looking forward to see how 02/13 waits for reconnection. > Seems like a wait_on_bit or wait_on_var triggered from nfsd4_init_conn() > would do, but that's just my wild speculation. Thanks for having a look! Well Jeff pointed out that we don't want to put a workqueue worker thread to sleep, so wait_on_* would work but is not an appropriate long-term solution. I'd rather prefer to have a mechanism where the callback RPC tasks can be "hard" and then they would wait in the RPC client for the reconnect. The issue then is how connection establishment is indicated so that the waiting tasks are woken at the right time. -- Chuck Lever