On Fri, Jan 10, 2025 at 5:09 PM Zhu Yanjun <yanjun.zhu@xxxxxxxxx> wrote: > > The Call Trace is as below: > " > <TASK> > ? show_regs.cold+0x1a/0x1f > ? __rxe_cleanup+0x12c/0x170 [rdma_rxe] > ? __warn+0x84/0xd0 > ? __rxe_cleanup+0x12c/0x170 [rdma_rxe] > ? report_bug+0x105/0x180 > ? handle_bug+0x46/0x80 > ? exc_invalid_op+0x19/0x70 > ? asm_exc_invalid_op+0x1b/0x20 > ? __rxe_cleanup+0x12c/0x170 [rdma_rxe] > ? __rxe_cleanup+0x124/0x170 [rdma_rxe] > rxe_destroy_qp.cold+0x24/0x29 [rdma_rxe] > ib_destroy_qp_user+0x118/0x190 [ib_core] > rdma_destroy_qp.cold+0x43/0x5e [rdma_cm] > rtrs_cq_qp_destroy.cold+0x1d/0x2b [rtrs_core] > rtrs_srv_close_work.cold+0x1b/0x31 [rtrs_server] > process_one_work+0x21d/0x3f0 > worker_thread+0x4a/0x3c0 > ? process_one_work+0x3f0/0x3f0 > kthread+0xf0/0x120 > ? kthread_complete_and_exit+0x20/0x20 > ret_from_fork+0x22/0x30 > </TASK> > " > When too many rdma resources are allocated, rxe needs more time to > handle these rdma resources. Sometimes with the current timeout, rxe > can not release the rdma resources correctly. > > Compared with other rdma drivers, a bigger timeout is used. > > Fixes: 215d0a755e1b ("RDMA/rxe: Stop lookup of partially built objects") > Signed-off-by: Zhu Yanjun <yanjun.zhu@xxxxxxxxx> We tested this patch. All the tests can pass with this patch. Tested-by: Joe Klein <joe.klein812@xxxxxxxxx> > --- > drivers/infiniband/sw/rxe/rxe_pool.c | 11 +++++------ > 1 file changed, 5 insertions(+), 6 deletions(-) > > diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c > index 67567d62195e..d9cb682fd71f 100644 > --- a/drivers/infiniband/sw/rxe/rxe_pool.c > +++ b/drivers/infiniband/sw/rxe/rxe_pool.c > @@ -178,7 +178,6 @@ int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable) > { > struct rxe_pool *pool = elem->pool; > struct xarray *xa = &pool->xa; > - static int timeout = RXE_POOL_TIMEOUT; > int ret, err = 0; > void *xa_ret; > > @@ -202,19 +201,19 @@ int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable) > * return to rdma-core > */ > if (sleepable) { > - if (!completion_done(&elem->complete) && timeout) { > + if (!completion_done(&elem->complete)) { > ret = wait_for_completion_timeout(&elem->complete, > - timeout); > + msecs_to_jiffies(50000)); > > /* Shouldn't happen. There are still references to > * the object but, rather than deadlock, free the > * object or pass back to rdma-core. > */ > if (WARN_ON(!ret)) > - err = -EINVAL; > + err = -ETIMEDOUT; > } > } else { > - unsigned long until = jiffies + timeout; > + unsigned long until = jiffies + RXE_POOL_TIMEOUT; > > /* AH objects are unique in that the destroy_ah verb > * can be called in atomic context. This delay > @@ -226,7 +225,7 @@ int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable) > mdelay(1); > > if (WARN_ON(!completion_done(&elem->complete))) > - err = -EINVAL; > + err = -ETIMEDOUT; > } > > if (pool->cleanup) > -- > 2.34.1 > >