On Thu, May 18, 2023 at 01:14:55PM +0800, Wen Gu wrote: > We found a crash when using SMCRv2 with 2 Mellanox ConnectX-4. It > can be reproduced by: > > - smc_run nginx > - smc_run wrk -t 32 -c 500 -d 30 http://<ip>:<port> > > BUG: kernel NULL pointer dereference, address: 0000000000000014 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 8000000108713067 P4D 8000000108713067 PUD 151127067 PMD 0 > Oops: 0000 [#1] PREEMPT SMP PTI > CPU: 4 PID: 2441 Comm: kworker/4:249 Kdump: loaded Tainted: G W E 6.4.0-rc1+ #42 > Workqueue: smc_hs_wq smc_listen_work [smc] > RIP: 0010:smc_clc_send_confirm_accept+0x284/0x580 [smc] > RSP: 0018:ffffb8294b2d7c78 EFLAGS: 00010a06 > RAX: ffff8f1873238880 RBX: ffffb8294b2d7dc8 RCX: 0000000000000000 > RDX: 00000000000000b4 RSI: 0000000000000001 RDI: 0000000000b40c00 > RBP: ffffb8294b2d7db8 R08: ffff8f1815c5860c R09: 0000000000000000 > R10: 0000000000000400 R11: 0000000000000000 R12: ffff8f1846f56180 > R13: ffff8f1815c5860c R14: 0000000000000001 R15: 0000000000000001 > FS: 0000000000000000(0000) GS:ffff8f1aefd00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000014 CR3: 00000001027a0001 CR4: 00000000003706e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <TASK> > ? mlx5_ib_map_mr_sg+0xa1/0xd0 [mlx5_ib] > ? smcr_buf_map_link+0x24b/0x290 [smc] > ? __smc_buf_create+0x4ee/0x9b0 [smc] > smc_clc_send_accept+0x4c/0xb0 [smc] > smc_listen_work+0x346/0x650 [smc] > ? __schedule+0x279/0x820 > process_one_work+0x1e5/0x3f0 > worker_thread+0x4d/0x2f0 > ? __pfx_worker_thread+0x10/0x10 > kthread+0xe5/0x120 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x2c/0x50 > </TASK> > > During the CLC handshake, server sequentially tries available SMCRv2 > and SMCRv1 devices in smc_listen_work(). > > If an SMCRv2 device is found. SMCv2 based link group and link will be > assigned to the connection. Then assumed that some buffer assignment > errors happen later in the CLC handshake, such as RMB registration > failure, server will give up SMCRv2 and try SMCRv1 device instead. But > the resources assigned to the connection won't be reset. > > When server tries SMCRv1 device, the connection creation process will > be executed again. Since conn->lnk has been assigned when trying SMCRv2, > it will not be set to the correct SMCRv1 link in > smcr_lgr_conn_assign_link(). So in such situation, conn->lgr points to > correct SMCRv1 link group but conn->lnk points to the SMCRv2 link > mistakenly. > > Then in smc_clc_send_confirm_accept(), conn->rmb_desc->mr[link->link_idx] > will be accessed. Since the link->link_idx is not correct, the related > MR may not have been initialized, so crash happens. > > | Try SMCRv2 device first > | |-> conn->lgr: assign existed SMCRv2 link group; > | |-> conn->link: assign existed SMCRv2 link (link_idx may be 1 in SMC_LGR_SYMMETRIC); > | |-> sndbuf & RMB creation fails, quit; > | > | Try SMCRv1 device then > | |-> conn->lgr: create SMCRv1 link group and assign; > | |-> conn->link: keep SMCRv2 link mistakenly; > | |-> sndbuf & RMB creation succeed, only RMB->mr[link_idx = 0] > | initialized. > | > | Then smc_clc_send_confirm_accept() accesses > | conn->rmb_desc->mr[conn->link->link_idx, which is 1], then crash. > v > > This patch tries to fix this by cleaning conn->lnk before assigning > link. In addition, it is better to reset the connection and clean the > resources assigned if trying SMCRv2 failed in buffer creation or > registration. > > Fixes: e49300a6bf62 ("net/smc: add listen processing for SMC-Rv2") > Link: https://lore.kernel.org/r/20220523055056.2078994-1-liuyacan@xxxxxxxxxxxxxxxx/ > Signed-off-by: Wen Gu <guwen@xxxxxxxxxxxxxxxxx> LGTM, thanks for your detailed analysis. Reviewed-by: Tony Lu <tonylu@xxxxxxxxxxxxxxxxx> > --- > net/smc/af_smc.c | 9 +++++++-- > net/smc/smc_core.c | 1 + > 2 files changed, 8 insertions(+), 2 deletions(-) > > diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c > index 50c38b6..538e9c6 100644 > --- a/net/smc/af_smc.c > +++ b/net/smc/af_smc.c > @@ -2000,8 +2000,10 @@ static int smc_listen_rdma_init(struct smc_sock *new_smc, > return rc; > > /* create send buffer and rmb */ > - if (smc_buf_create(new_smc, false)) > + if (smc_buf_create(new_smc, false)) { > + smc_conn_abort(new_smc, ini->first_contact_local); > return SMC_CLC_DECL_MEM; > + } > > return 0; > } > @@ -2217,8 +2219,11 @@ static void smc_find_rdma_v2_device_serv(struct smc_sock *new_smc, > smcr_version = ini->smcr_version; > ini->smcr_version = SMC_V2; > rc = smc_listen_rdma_init(new_smc, ini); > - if (!rc) > + if (!rc) { > rc = smc_listen_rdma_reg(new_smc, ini->first_contact_local); > + if (rc) > + smc_conn_abort(new_smc, ini->first_contact_local); > + } > if (!rc) > return; > ini->smcr_version = smcr_version; > diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c > index 4543567..3f465fa 100644 > --- a/net/smc/smc_core.c > +++ b/net/smc/smc_core.c > @@ -127,6 +127,7 @@ static int smcr_lgr_conn_assign_link(struct smc_connection *conn, bool first) > int i, j; > > /* do link balancing */ > + conn->lnk = NULL; /* reset conn->lnk first */ > for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) { > struct smc_link *lnk = &conn->lgr->lnk[i]; > > -- > 1.8.3.1