On 17/04/2023 11:08, Guoqing Jiang wrote: > > > On 4/14/23 18:09, Zhijian Li (Fujitsu) wrote: >> >> On 14/04/2023 14:04, Guoqing Jiang wrote: >>> >>> On 4/14/23 13:37, Zhijian Li (Fujitsu) wrote: >>>> On 14/04/2023 11:40, Guoqing Jiang wrote: >>>>> On 4/13/23 16:12, Zhijian Li (Fujitsu) wrote: >>>>>> On 13/04/2023 15:35, Guoqing Jiang wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I take a closer look today. >>>>>>> >>>>>>> On 4/12/23 09:15, Zhijian Li (Fujitsu) wrote: >>>>>>>> On 11/04/2023 20:26, Leon Romanovsky wrote: >>>>>>>>> On Tue, Apr 11, 2023 at 02:43:46AM +0000, Zhijian Li (Fujitsu) wrote: >>>>>>>>>> On 10/04/2023 21:10, Guoqing Jiang wrote: >>>>>>>>>>> On 4/10/23 20:08, Leon Romanovsky wrote: >>>>>>>>>>>> On Mon, Apr 10, 2023 at 06:43:03AM +0000, Li Zhijian wrote: >>>>>>>>>>>>> The warning occurs when destroying PD whose reference count is not zero. >>>>>>>>>>>>> >>>>>>>>>>>>> Precodition: clt_path->s.con_num is 2. >>>>>>>>>>>>> So 2 cm connection will be created as below: >>>>>>>>>>>>> CPU0 CPU1 >>>>>>>>>>>>> init_conns { | >>>>>>>>>>>>> create_cm() // a. con[0] created | >>>>>>>>>>>>> | a'. rtrs_clt_rdma_cm_handler() { >>>>>>>>>>>>> | rtrs_rdma_addr_resolved() >>>>>>>>>>>>> | create_con_cq_qp(con); << con[0] >>>>>>>>>>>>> | } >>>>>>>>>>>>> | in this moment, refcnt of PD was increased to 2+ >>>>>>> What do you mean "refcnt of PD"? usecnt in struct ib_pd or dev_ref. >>>>>> I mean usecnt in struct ib_pd >>>>>> >>>>>>>>>>>>> | >>>>>>>>>>>>> create_cm() // b. cid = 1, failed | >>>>>>>>>>>>> destroy_con_cq_qp() | >>>>>>>>>>>>> rtrs_ib_dev_put() | >>>>>>>>>>>>> dev_free() | >>>>>>>>>>>>> ib_dealloc_pd(dev->ib_pd) << PD | >>>>>>>>>>>>> is destroyed, but refcnt is | >>>>>>>>>>>>> still greater than 0 | >>>>>>> Assuming you mean "pd->usecnt". We only allocate pd in con[0] by rtrs_ib_dev_find_or_add, >>>>>>> if con[1] failed to create cm, then alloc_path_reqs -> ib_alloc_mr -> atomic_inc(&pd->usecnt) >>>>> The above can't be invoked, right? >>>>> >>>>>>> can't be triggered. Is there other places could increase the refcnt? >>>>>> Yes, when create a qp, it will also associate to this PD, that also mean refcnt of PD will be increased. >>>>>> >>>>>> When con[0](create_con_cq_qp) succeeded, refcnt of PD will be 2. and then when con[1] failed, since >>>>>> QP didn't create, refcnt of PD is still 2. con[1]'s cleanup will destroy the PD(ib_dealloc_pd) since dev_ref = 1, after that its >>>>>> refcnt is still 1. >>>>> I can see the path increase usecnt to 1. >>>>> >>>>> rtrs_cq_qp_create -> create_qp >>>>> -> rdma_create_qp >>>>> -> ib_create_qp >>>>> -> create_qp >>>>> -> ib_qp_usecnt_inc which increases pd->usecnt >>>>> >>>>> Where is another place to increase usecnt to 2? >>>> It should be >>>> ib_create_qp ... >>>> -> rxe_create_qp >>>> -> rxe_qp_from_init >>>> -> rxe_get(pd) <<< pd's refcnt will be increased. >>> Isn't rxe_get just increase elem->ref_cnt? >> Yes, that's true. > > I am confused, does increase ref_cnt equal to increase usecnt? I need to apologize for my mistake. I have been referring to the elem.ref_cnt of the rxe driver as the refcnt of PD. > If not, then where is another place to increase usecnt to 2? > > BTW, I traced with 6.3-rc5, seems pd's usecnt is only increase once > after create one connection. And the warning mentioned above it also pointed to the PD's elem.ref_cnt. > > [ 6941.525088] in init_conns 2353 con_num=3 > [ 6941.525732] in create_con_cq_qp 1648 > [ 6941.525944] in rtrs_cq_qp_create 311 con->cid=0 path->dev->ib_pd->usecnt=1 > [ 6941.532460] in create_con_cq_qp 1648 > [ 6941.532746] in rtrs_cq_qp_create 311 con->cid=1 path->dev->ib_pd->usecnt=2 > [ 6941.533183] in create_con_cq_qp 1648 > [ 6941.533464] in rtrs_cq_qp_create 311 con->cid=2 path->dev->ib_pd->usecnt=3 > [ 6941.533685] in init_conns 2365, clt_path->s.dev->ib_pd->usecnt=3 > [ 6941.535680] in init_conns 2371, clt_path->s.dev->ib_pd->usecnt=515 Thanks Zhijian below is a piece of code that i used to debug this issue. --- a/drivers/infiniband/ulp/rtrs/rtrs-clt.c +++ b/drivers/infiniband/ulp/rtrs/rtrs-clt.c @@ -1730,15 +1730,39 @@ static int create_con_cq_qp(struct rtrs_clt_con *con) return err; } +struct rxe_pool; +struct rxe_pool_elem { + struct rxe_pool *pool; + void *obj; + struct kref ref_cnt; + struct list_head list; + struct completion complete; + u32 index; +}; + +struct rxe_pd { + struct ib_pd ibpd; + struct rxe_pool_elem elem; +}; + +static inline struct rxe_pd *to_rpd(struct ib_pd *pd) +{ + return pd ? container_of(pd, struct rxe_pd, ibpd) : NULL; +} + +#define rxe_read(obj) kref_read(&(obj)->elem.ref_cnt) static void destroy_con_cq_qp(struct rtrs_clt_con *con) { struct rtrs_clt_path *clt_path = to_clt_path(con->c.path); + struct rtrs_ib_dev *dev = clt_path->s.dev; + struct rxe_pd *pd = to_rpd(dev->ib_pd); /* * Be careful here: destroy_con_cq_qp() can be called even * create_con_cq_qp() failed, see comments there. */ lockdep_assert_held(&con->con_mutex); + rtrs_info(clt_path->clt, "%s: clt_path->s.dev_ref: %d, pd %px, ref: %d\n", __func__, clt_path->s.dev_ref, &pd->elem, rxe_read(pd)); rtrs_cq_qp_destroy(&con->c); if (con->rsp_ius) { rtrs_iu_free(con->rsp_ius, clt_path->s.dev->ib_dev, @@ -1746,7 +1770,8 @@ static void destroy_con_cq_qp(struct rtrs_clt_con *con) con->rsp_ius = NULL; con->queue_num = 0; } + rtrs_info(clt_path->clt, "%s: clt_path->s.dev_ref: %d, pd %px, ref: %d\n", __func__, clt_path->s.dev_ref, &pd->elem, rxe_read(pd)); if (clt_path->s.dev_ref && !--clt_path->s.dev_ref) { rtrs_ib_dev_put(clt_path->s.dev); clt_path->s.dev = NULL; > > Thanks, > Guoqing