On 22.11.24 08:16, Wen Gu wrote: > We encountered a LGR/link use-after-free issue, which manifested as > the LGR/link refcnt reaching 0 early and entering the clear process, > making resource access unsafe. > > refcount_t: addition on 0; use-after-free. > WARNING: CPU: 14 PID: 107447 at lib/refcount.c:25 refcount_warn_saturate+0x9c/0x140 > Workqueue: events smc_lgr_terminate_work [smc] > Call trace: > refcount_warn_saturate+0x9c/0x140 > __smc_lgr_terminate.part.45+0x2a8/0x370 [smc] > smc_lgr_terminate_work+0x28/0x30 [smc] > process_one_work+0x1b8/0x420 > worker_thread+0x158/0x510 > kthread+0x114/0x118 > > or > > refcount_t: underflow; use-after-free. > WARNING: CPU: 6 PID: 93140 at lib/refcount.c:28 refcount_warn_saturate+0xf0/0x140 > Workqueue: smc_hs_wq smc_listen_work [smc] > Call trace: > refcount_warn_saturate+0xf0/0x140 > smcr_link_put+0x1cc/0x1d8 [smc] > smc_conn_free+0x110/0x1b0 [smc] > smc_conn_abort+0x50/0x60 [smc] > smc_listen_find_device+0x75c/0x790 [smc] > smc_listen_work+0x368/0x8a0 [smc] > process_one_work+0x1b8/0x420 > worker_thread+0x158/0x510 > kthread+0x114/0x118 > > It is caused by repeated release of LGR/link refcnt. One suspect is that > smc_conn_free() is called repeatedly because some smc_conn_free() are not > protected by sock lock. > > Calls under socklock | Calls not under socklock > ------------------------------------------------------- > lock_sock(sk) | smc_conn_abort > smc_conn_free | \- smc_conn_free > \- smcr_link_put | \- smcr_link_put (duplicated) > release_sock(sk) > > So make sure smc_conn_free() is called under the sock lock. > > Fixes: 8cf3f3e42374 ("net/smc: use helper smc_conn_abort() in listen processing") > Co-developed-by: Guangguan Wang <guangguan.wang@xxxxxxxxxxxxxxxxx> > Signed-off-by: Guangguan Wang <guangguan.wang@xxxxxxxxxxxxxxxxx> > Co-developed-by: Kai <KaiShen@xxxxxxxxxxxxxxxxx> > Signed-off-by: Kai <KaiShen@xxxxxxxxxxxxxxxxx> > Signed-off-by: Wen Gu <guwen@xxxxxxxxxxxxxxxxx> > --- > net/smc/af_smc.c | 25 +++++++++++++++++++++---- > 1 file changed, 21 insertions(+), 4 deletions(-) > > diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c > index ed6d4d520bc7..e0a7a0151b11 100644 > --- a/net/smc/af_smc.c > +++ b/net/smc/af_smc.c > @@ -973,7 +973,8 @@ static int smc_connect_decline_fallback(struct smc_sock *smc, int reason_code, > return smc_connect_fallback(smc, reason_code); > } > > -static void smc_conn_abort(struct smc_sock *smc, int local_first) > +static void __smc_conn_abort(struct smc_sock *smc, int local_first, > + bool locked) > { > struct smc_connection *conn = &smc->conn; > struct smc_link_group *lgr = conn->lgr; > @@ -982,11 +983,27 @@ static void smc_conn_abort(struct smc_sock *smc, int local_first) > if (smc_conn_lgr_valid(conn)) > lgr_valid = true; > > - smc_conn_free(conn); > + if (!locked) { > + lock_sock(&smc->sk); > + smc_conn_free(conn); > + release_sock(&smc->sk); > + } else { > + smc_conn_free(conn); > + } > if (local_first && lgr_valid) > smc_lgr_cleanup_early(lgr); > } > > +static void smc_conn_abort(struct smc_sock *smc, int local_first) > +{ > + __smc_conn_abort(smc, local_first, false); > +} > + > +static void smc_conn_abort_locked(struct smc_sock *smc, int local_first) > +{ > + __smc_conn_abort(smc, local_first, true); > +} > + > /* check if there is a rdma device available for this connection. */ > /* called for connect and listen */ > static int smc_find_rdma_device(struct smc_sock *smc, struct smc_init_info *ini) > @@ -1352,7 +1369,7 @@ static int smc_connect_rdma(struct smc_sock *smc, > > return 0; > connect_abort: > - smc_conn_abort(smc, ini->first_contact_local); > + smc_conn_abort_locked(smc, ini->first_contact_local); > mutex_unlock(&smc_client_lgr_pending); > smc->connect_nonblock = 0; > > @@ -1454,7 +1471,7 @@ static int smc_connect_ism(struct smc_sock *smc, > > return 0; > connect_abort: > - smc_conn_abort(smc, ini->first_contact_local); > + smc_conn_abort_locked(smc, ini->first_contact_local); > mutex_unlock(&smc_server_lgr_pending); > smc->connect_nonblock = 0; > I wonder if this can deadlock, when you take lock_sock so far down in the callchain. example: smc_connect will first take lock_sock(sk) and then mutex_lock(&smc_server_lgr_pending); (e.g. in smc_connect_ism()) wheras smc_listen_work() will take mutex_lock(&smc_server_lgr_pending); and then lock_sock(sk) (in your __smc_conn_abort(,,false)) I am not sure whether this can be called on the same socket, but it looks suspicious to me. All callers of smc_conn_abort() without socklock seem to originate from smc_listen_work(). That makes me think whether smc_listen_work() should do lock_sock(sk) on a higher level. Do you have an example which function could collide with smc_listen_work()? i.e. have you found a way to reproduce this? Are you sure that all callers of smc_conn_free(), that are not smc_conn_abort(), do set the socklock? It seems to me that the path of smc_conn_kill() is not covered by your solution. Please excuse, that I am not deeply familiar with this code. I'm just trying to ask helpful questions.