Re: NFS Client crash during migration scenario

Bharath SM <bharathsm.hsk@xxxxxxxxx> · Sun, 10 Nov 2024 21:38:38 +0530

On Fri, Nov 1, 2024 at 11:40 PM Anna Schumaker
<anna.schumaker@xxxxxxxxxx> wrote:
>
> Hi Bharath,
>
> On 10/21/24 2:26 AM, Bharath SM wrote:
> > We are working on a prototype of NFS v4.1 server migration and during
> > our testing we are noticing the following crashes with NFS clients
> > during the movement of traffic from Server-1 to Server-2.
> > Tested on Ubuntu distro with 5.15, 6.5 and 6.8 Linux kernel versions
> > and observed following crashes while accessing invalid xprt
> > structures. Can you please take a look at the issue and my findings so
> > far and suggest next steps?
> > Also please let me know if you would need additional information.
> >
> > 1.  Crash call stack:
> >
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253541] RIP:
> > 0010:xprt_reserve+0x3c/0xd0 [sunrpc]
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253601] Code: bf b8 00 00
> > 00 00 c7 47 04 00 00 00 00 4c 8b af a8 00 00 00 74 0d 5b 41 5c 41 5d
> > 41 5e 5d c3 cc cc cc cc c7 47 04 f5 ff ff ff <49> 8b 85 08 04 00 00 49
> > 89 fc f6 c4 02 75 28 49 8b
> > 45 08 4c 89 e6
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253603] RSP:
> > 0018:ff5d77a18d147b58 EFLAGS: 00010246
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253606] RAX:
> > 0000000000000000 RBX: ff1ef854fc4b0000 RCX: 0000000000000000
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253607] RDX:
> > 0000000000000000 RSI: 0000000000000000 RDI: ff1ef854d2191d00
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253608] RBP:
> > ff5d77a18d147b78 R08: ff1ef854d2191d30 R09: ffffffff8fa071a0
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253609] R10:
> > ff1ef854d2191d00 R11: 0000000000000076 R12: ff1ef854d2191d00
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253611] R13:
> > 0000000000000000 R14: 0000000000000000 R15: ffffffffc0d2e5e0
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253613] FS:
> > 0000000000000000(0000) GS:ff1ef8586fc40000(0000)
> > knlGS:0000000000000000
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253615] CS:  0010 DS: 0000
> > ES: 0000 CR0: 0000000080050033
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253617] CR2:
> > 0000000000000408 CR3: 00000003ce43a005 CR4: 0000000000371ee0
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253618] DR0:
> > 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253619] DR3:
> > 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253620] Call Trace:
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253621]  <TASK>
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253623]  ? show_regs+0x6a/0x80
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253628]  ? __die+0x25/0x70
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253630]  ?
> > page_fault_oops+0x79/0x180
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253634]  ?
> > do_user_addr_fault+0x320/0x660
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253636]  ? vsnprintf+0x37d/0x550
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253640]  ? exc_page_fault+0x74/0x160
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253644]  ?
> > asm_exc_page_fault+0x27/0x30
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253648]  ?
> > __pfx_call_reserve+0x10/0x10 [sunrpc]
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253688]  ?
> > xprt_reserve+0x3c/0xd0 [sunrpc]
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253730]
> > call_reserve+0x1d/0x30 [sunrpc]
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253773]
> > __rpc_execute+0xc1/0x2e0 [sunrpc]
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253830]
> > rpc_execute+0xbb/0xf0 [sunrpc]
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253885]
> > rpc_run_task+0x12e/0x190 [sunrpc]
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253926]
> > rpc_call_sync+0x51/0xb0 [sunrpc]
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.253968]
> > _nfs4_proc_create_session+0x17a/0x370 [nfsv4]
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254020]  ? vprintk_default+0x1d/0x30
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254024]  ? vprintk+0x3c/0x70
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254026]  ? _printk+0x58/0x80
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254029]
> > nfs4_proc_create_session+0x67/0x130 [nfsv4]
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254065]  ?
> > nfs4_proc_create_session+0x67/0x130 [nfsv4]
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254100]  ?
> > __pfx_nfs4_test_session_trunk+0x10/0x10 [nfsv4]
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254133]
> > nfs4_reset_session+0xb2/0x1a0 [nfsv4]
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254170]
> > nfs4_state_manager+0x3cc/0x950 [nfsv4]
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254206]  ?
> > kernel_sigaction+0x79/0x110
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254209]  ?
> > __pfx_nfs4_run_state_manager+0x10/0x10 [nfsv4]
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254244]
> > nfs4_run_state_manager+0x64/0x170 [nfsv4]
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254280]  ?
> > __pfx_nfs4_run_state_manager+0x10/0x10 [nfsv4]
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254314]  kthread+0xd6/0x100
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254317]  ? __pfx_kthread+0x10/0x10
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254320]  ret_from_fork+0x3c/0x60
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254325]  ? __pfx_kthread+0x10/0x10
> > Jul  8 19:35:38 Ubuntu2204-1 kernel: [ 4453.254328]  ret_from_fork_asm
> >
> > I looked at this crash and noticed following:
> >
> > During NFS migration, when traffic starts going to server 2, we do see
> > BAD SESSION errors and the reason for the bad session is that the
> > client sends an old session ID to server-2 for some of the requests
> > even though the same session has been destroyed in server-1 along with
> > its client ID just few 100's milliseconds back. (Not sure if there is
> > some caching playing a role here or if session draining is not
> > happening cleanly).
> > Crash is happening on the client because, in response to a BAD SESSION
> > error from the server, the client tries to attempt the session
> > recovery and fails to get the transport for a rpc request, resulting
> > in a crash. (accessing already freed up structures?)
> >
> > Snippet from trace-cmd logs:
> >
> > Client ID was being released here:
> >    lxfilesstress-3416  [005]  5154.980981: rpc_clnt_shutdown:    client=00000000
> >    lxfilesstress-3416  [005]  5154.980982: rpc_clnt_release:     client=00000000
> >
> > Just after a few ms, the same client ID is being used here for session
> > recovery operation after it has been freed up earlier:
> > 20.150.9ß/I\^L;l-22349 [006]  5155.042037: rpc_request:
> > task:00000011@00000000 nfsv4 CREATE_SESSION (sync)
> > 20.150.9ß/I\^L;l-22349 [006]  5155.042040: rpc_task_run_action:
> > task:00000011@00000000
> > flags=DYNAMIC|NO_ROUND_ROBIN|SOFT|TIMEOUT|NORTO|CRED_NOREF
> > runstate=RUNNING|ACTIVE status=0 action=call_reserve ------> Crash
> > after this in call_reserve.
>
> So, the first thing that's sticking out to me in looking through nfs4state.c is that we're setting, testing, and clearing the various state recovery bits outside of a spinlock (and taking the lock right after). This seems like it could potentially open up races, which could be what you're seeing (how often does the crash happen for you, is it every time or only sometimes?).
>

Hi Anna, Thanks for your response and suggestions. Issue does not
happen every time in general but can be reproduced consistently with
stress tests/high load during migration.
Sure, I will try getting spinlock on 'clp->cl_state' in file
nfs4state.c before accessing state recovery bits and try to reproduce
the issue and update here.

> >
> >
> > Client code:
> >
> > void xprt_reserve(struct rpc_task *task)
> > {
> >         struct rpc_xprt *xprt = task->tk_xprt;
> > --------------------<<<<<<<<<<<<<<<< tk_xprt pointer is NULL
> >         task->tk_status = 0;
> >         if (task->tk_rqstp != NULL)
> >                 return;
> >         task->tk_status = -EAGAIN;
> >         if (!xprt_throttle_congested(xprt, task)) {
> >                 xprt_do_reserve(xprt, task);
> >         }
> > }
> >
> > tk_xprt is NULL in above function because, during session recovery rpc
> > request, sunrpc module tries to get xprt using
> > ‘rpc_task_get_first_xprt(clnt)’ for a given client but it this
> > function returns NULL.
> >
> > static void rpc_task_set_transport(struct rpc_task *task, struct rpc_clnt *clnt)
> > {
> >     if (task->tk_xprt) {
> >         if (!(test_bit(XPRT_OFFLINE, &task->tk_xprt->state) &&
> >               (task->tk_flags & RPC_TASK_MOVEABLE)))
> >             return;
> >         xprt_release(task);
> >         xprt_put(task->tk_xprt);
> >     }
> >
> >     if (task->tk_flags & RPC_TASK_NO_ROUND_ROBIN)
> >         task->tk_xprt = rpc_task_get_first_xprt(clnt);
> > ------------------------------- Returns NULL
> >     else
> >         task->tk_xprt = rpc_task_get_next_xprt(clnt);
> >
> > }
> >
> >
> > We also noticed that, In nfs4_update_server after nfs4_set_client
> > call, we aren't calling nfs_init_server_rpcclient() whereas in all
> > other cases, after nfs4_set_clien we always called
> > nfs_init_server_rpcclient(). Can you please let me know if this is
> > expected.?
>
> From my reading of the code, this is expected. nfs4_update_server() uses rpc_switch_client_transport() instead to swap out the transport for the underlying rpc_client. This lets us reuse the current rpc_client, rather than creating a new one with nfs_init_server_rpcclient().
>
>
> >
> >
> > 2. Another crash call stack with xprt structures being NULL just after
> > migration.
> >
> > PID: 306169  TASK: ff4a4584198c8000  CPU: 6   COMMAND: "kworker/u16:0"
> >  #0 [ff5630b9cb397a10] machine_kexec at ffffffffb9e95ac2
> >  #1 [ff5630b9cb397a70] __crash_kexec at ffffffffb9fe022f
> >  #2 [ff5630b9cb397b38] panic at ffffffffb9ed62ce
> >  #3 [ff5630b9cb397bb8] oops_end at ffffffffb9e405a7
> >  #4 [ff5630b9cb397be0] page_fault_oops at ffffffffb9eaa482
> >  #5 [ff5630b9cb397c68] do_user_addr_fault at ffffffffb9eab090
> >  #6 [ff5630b9cb397cb0] exc_page_fault at ffffffffbadb1884
> >  #7 [ff5630b9cb397ce0] asm_exc_page_fault at ffffffffbae00bc7
> >     [exception RIP: xprt_release+317]
> >     RIP: ffffffffc0522afd  RSP: ff5630b9cb397d98  RFLAGS: 00010286
> >     RAX: 0000000000000045  RBX: ff4a4584198c8000  RCX: 0000000000000027
> >     RDX: 0000000000000000  RSI: 0000000000000001  RDI: ff4a45840a573d00
> >     RBP: ff5630b9cb397db8   R8: 000000007867a738   R9: 0000000000000020
> >     R10: 0000000000ffff10  R11: 000000000000000f  R12: ff4a45840a573d00
> >     R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
> >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> >  #8 [ff5630b9cb397dc0] rpc_release_resources_task at ffffffffc0533593 [sunrpc]
> >  #9 [ff5630b9cb397dd8] __rpc_execute at ffffffffc053d3ff [sunrpc]
> > #10 [ff5630b9cb397e38] rpc_async_schedule at ffffffffc053d830 [sunrpc]
> > #11 [ff5630b9cb397e58] process_one_work at ffffffffb9efd14e
> > #12 [ff5630b9cb397ea0] worker_thread at ffffffffb9efd3a0
> > #13 [ff5630b9cb397ee8] kthread at ffffffffb9f06dd6
> > #14 [ff5630b9cb397f28] ret_from_fork at ffffffffb9e4c62c
> > #15 [ff5630b9cb397f50] ret_from_fork_asm at ffffffffb9e038db
>
> I would be curious to hear if doing the test / set / clear bit operations mentioned above inside the spin lock region makes a difference here.
>
Sure, I will test and update the results here.

> I hope this helps!
> Anna
>
> >
> >
> > Thanks,
> > Bharath
> >
>