We are working on a prototype of NFS v4.1 server migration and during our testing we are noticing the following crashes with NFS clients during the movement of traffic from Server-1 to Server-2. Tested on Ubuntu distro with 5.15, 6.5 and 6.8 Linux kernel versions and observed following crashes while accessing invalid xprt structures. Can you please take a look at the issue and my findings so far and suggest next steps? Also please let me know if you would need additional information. 1. Crash call stack: Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253541] RIP: 0010:xprt_reserve+0x3c/0xd0 [sunrpc] Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253601] Code: bf b8 00 00 00 00 c7 47 04 00 00 00 00 4c 8b af a8 00 00 00 74 0d 5b 41 5c 41 5d 41 5e 5d c3 cc cc cc cc c7 47 04 f5 ff ff ff <49> 8b 85 08 04 00 00 49 89 fc f6 c4 02 75 28 49 8b 45 08 4c 89 e6 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253603] RSP: 0018:ff5d77a18d147b58 EFLAGS: 00010246 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253606] RAX: 0000000000000000 RBX: ff1ef854fc4b0000 RCX: 0000000000000000 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253607] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff1ef854d2191d00 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253608] RBP: ff5d77a18d147b78 R08: ff1ef854d2191d30 R09: ffffffff8fa071a0 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253609] R10: ff1ef854d2191d00 R11: 0000000000000076 R12: ff1ef854d2191d00 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253611] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffffc0d2e5e0 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253613] FS: 0000000000000000(0000) GS:ff1ef8586fc40000(0000) knlGS:0000000000000000 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253615] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253617] CR2: 0000000000000408 CR3: 00000003ce43a005 CR4: 0000000000371ee0 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253618] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253619] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253620] Call Trace: Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253621] <TASK> Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253623] ? show_regs+0x6a/0x80 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253628] ? __die+0x25/0x70 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253630] ? page_fault_oops+0x79/0x180 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253634] ? do_user_addr_fault+0x320/0x660 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253636] ? vsnprintf+0x37d/0x550 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253640] ? exc_page_fault+0x74/0x160 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253644] ? asm_exc_page_fault+0x27/0x30 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253648] ? __pfx_call_reserve+0x10/0x10 [sunrpc] Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253688] ? xprt_reserve+0x3c/0xd0 [sunrpc] Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253730] call_reserve+0x1d/0x30 [sunrpc] Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253773] __rpc_execute+0xc1/0x2e0 [sunrpc] Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253830] rpc_execute+0xbb/0xf0 [sunrpc] Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253885] rpc_run_task+0x12e/0x190 [sunrpc] Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253926] rpc_call_sync+0x51/0xb0 [sunrpc] Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253968] _nfs4_proc_create_session+0x17a/0x370 [nfsv4] Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254020] ? vprintk_default+0x1d/0x30 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254024] ? vprintk+0x3c/0x70 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254026] ? _printk+0x58/0x80 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254029] nfs4_proc_create_session+0x67/0x130 [nfsv4] Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254065] ? nfs4_proc_create_session+0x67/0x130 [nfsv4] Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254100] ? __pfx_nfs4_test_session_trunk+0x10/0x10 [nfsv4] Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254133] nfs4_reset_session+0xb2/0x1a0 [nfsv4] Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254170] nfs4_state_manager+0x3cc/0x950 [nfsv4] Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254206] ? kernel_sigaction+0x79/0x110 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254209] ? __pfx_nfs4_run_state_manager+0x10/0x10 [nfsv4] Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254244] nfs4_run_state_manager+0x64/0x170 [nfsv4] Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254280] ? __pfx_nfs4_run_state_manager+0x10/0x10 [nfsv4] Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254314] kthread+0xd6/0x100 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254317] ? __pfx_kthread+0x10/0x10 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254320] ret_from_fork+0x3c/0x60 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254325] ? __pfx_kthread+0x10/0x10 Jul 8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254328] ret_from_fork_asm I looked at this crash and noticed following: During NFS migration, when traffic starts going to server 2, we do see BAD SESSION errors and the reason for the bad session is that the client sends an old session ID to server-2 for some of the requests even though the same session has been destroyed in server-1 along with its client ID just few 100's milliseconds back. (Not sure if there is some caching playing a role here or if session draining is not happening cleanly). Crash is happening on the client because, in response to a BAD SESSION error from the server, the client tries to attempt the session recovery and fails to get the transport for a rpc request, resulting in a crash. (accessing already freed up structures?) Snippet from trace-cmd logs: Client ID was being released here: lxfilesstress-3416 [005] 5154.980981: rpc_clnt_shutdown: client=00000000 lxfilesstress-3416 [005] 5154.980982: rpc_clnt_release: client=00000000 Just after a few ms, the same client ID is being used here for session recovery operation after it has been freed up earlier: 20.150.9ß/I\^L;l-22349 [006] 5155.042037: rpc_request: task:00000011@00000000 nfsv4 CREATE_SESSION (sync) 20.150.9ß/I\^L;l-22349 [006] 5155.042040: rpc_task_run_action: task:00000011@00000000 flags=DYNAMIC|NO_ROUND_ROBIN|SOFT|TIMEOUT|NORTO|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_reserve ------> Crash after this in call_reserve. Client code: void xprt_reserve(struct rpc_task *task) { struct rpc_xprt *xprt = task->tk_xprt; --------------------<<<<<<<<<<<<<<<< tk_xprt pointer is NULL task->tk_status = 0; if (task->tk_rqstp != NULL) return; task->tk_status = -EAGAIN; if (!xprt_throttle_congested(xprt, task)) { xprt_do_reserve(xprt, task); } } tk_xprt is NULL in above function because, during session recovery rpc request, sunrpc module tries to get xprt using ‘rpc_task_get_first_xprt(clnt)’ for a given client but it this function returns NULL. static void rpc_task_set_transport(struct rpc_task *task, struct rpc_clnt *clnt) { if (task->tk_xprt) { if (!(test_bit(XPRT_OFFLINE, &task->tk_xprt->state) && (task->tk_flags & RPC_TASK_MOVEABLE))) return; xprt_release(task); xprt_put(task->tk_xprt); } if (task->tk_flags & RPC_TASK_NO_ROUND_ROBIN) task->tk_xprt = rpc_task_get_first_xprt(clnt); ------------------------------- Returns NULL else task->tk_xprt = rpc_task_get_next_xprt(clnt); } We also noticed that, In nfs4_update_server after nfs4_set_client call, we aren't calling nfs_init_server_rpcclient() whereas in all other cases, after nfs4_set_clien we always called nfs_init_server_rpcclient(). Can you please let me know if this is expected.? 2. Another crash call stack with xprt structures being NULL just after migration. PID: 306169 TASK: ff4a4584198c8000 CPU: 6 COMMAND: "kworker/u16:0" #0 [ff5630b9cb397a10] machine_kexec at ffffffffb9e95ac2 #1 [ff5630b9cb397a70] __crash_kexec at ffffffffb9fe022f #2 [ff5630b9cb397b38] panic at ffffffffb9ed62ce #3 [ff5630b9cb397bb8] oops_end at ffffffffb9e405a7 #4 [ff5630b9cb397be0] page_fault_oops at ffffffffb9eaa482 #5 [ff5630b9cb397c68] do_user_addr_fault at ffffffffb9eab090 #6 [ff5630b9cb397cb0] exc_page_fault at ffffffffbadb1884 #7 [ff5630b9cb397ce0] asm_exc_page_fault at ffffffffbae00bc7 [exception RIP: xprt_release+317] RIP: ffffffffc0522afd RSP: ff5630b9cb397d98 RFLAGS: 00010286 RAX: 0000000000000045 RBX: ff4a4584198c8000 RCX: 0000000000000027 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ff4a45840a573d00 RBP: ff5630b9cb397db8 R8: 000000007867a738 R9: 0000000000000020 R10: 0000000000ffff10 R11: 000000000000000f R12: ff4a45840a573d00 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ff5630b9cb397dc0] rpc_release_resources_task at ffffffffc0533593 [sunrpc] #9 [ff5630b9cb397dd8] __rpc_execute at ffffffffc053d3ff [sunrpc] #10 [ff5630b9cb397e38] rpc_async_schedule at ffffffffc053d830 [sunrpc] #11 [ff5630b9cb397e58] process_one_work at ffffffffb9efd14e #12 [ff5630b9cb397ea0] worker_thread at ffffffffb9efd3a0 #13 [ff5630b9cb397ee8] kthread at ffffffffb9f06dd6 #14 [ff5630b9cb397f28] ret_from_fork at ffffffffb9e4c62c #15 [ff5630b9cb397f50] ret_from_fork_asm at ffffffffb9e038db Thanks, Bharath