Re: [PATCH 1/1] RDMA/rxe: Fix the warning "__rxe_cleanup+0x12c/0x170 [rdma_rxe]"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jan 10, 2025 at 5:09 PM Zhu Yanjun <yanjun.zhu@xxxxxxxxx> wrote:
>
> The Call Trace is as below:
> "
>   <TASK>
>   ? show_regs.cold+0x1a/0x1f
>   ? __rxe_cleanup+0x12c/0x170 [rdma_rxe]
>   ? __warn+0x84/0xd0
>   ? __rxe_cleanup+0x12c/0x170 [rdma_rxe]
>   ? report_bug+0x105/0x180
>   ? handle_bug+0x46/0x80
>   ? exc_invalid_op+0x19/0x70
>   ? asm_exc_invalid_op+0x1b/0x20
>   ? __rxe_cleanup+0x12c/0x170 [rdma_rxe]
>   ? __rxe_cleanup+0x124/0x170 [rdma_rxe]
>   rxe_destroy_qp.cold+0x24/0x29 [rdma_rxe]
>   ib_destroy_qp_user+0x118/0x190 [ib_core]
>   rdma_destroy_qp.cold+0x43/0x5e [rdma_cm]
>   rtrs_cq_qp_destroy.cold+0x1d/0x2b [rtrs_core]
>   rtrs_srv_close_work.cold+0x1b/0x31 [rtrs_server]
>   process_one_work+0x21d/0x3f0
>   worker_thread+0x4a/0x3c0
>   ? process_one_work+0x3f0/0x3f0
>   kthread+0xf0/0x120
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x22/0x30
>   </TASK>
> "
> When too many rdma resources are allocated, rxe needs more time to
> handle these rdma resources. Sometimes with the current timeout, rxe
> can not release the rdma resources correctly.
>
> Compared with other rdma drivers, a bigger timeout is used.
>
> Fixes: 215d0a755e1b ("RDMA/rxe: Stop lookup of partially built objects")
> Signed-off-by: Zhu Yanjun <yanjun.zhu@xxxxxxxxx>

We tested this patch. All the tests can pass with this patch.

Tested-by: Joe Klein <joe.klein812@xxxxxxxxx>

> ---
>  drivers/infiniband/sw/rxe/rxe_pool.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
> index 67567d62195e..d9cb682fd71f 100644
> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
> @@ -178,7 +178,6 @@ int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
>  {
>         struct rxe_pool *pool = elem->pool;
>         struct xarray *xa = &pool->xa;
> -       static int timeout = RXE_POOL_TIMEOUT;
>         int ret, err = 0;
>         void *xa_ret;
>
> @@ -202,19 +201,19 @@ int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
>          * return to rdma-core
>          */
>         if (sleepable) {
> -               if (!completion_done(&elem->complete) && timeout) {
> +               if (!completion_done(&elem->complete)) {
>                         ret = wait_for_completion_timeout(&elem->complete,
> -                                       timeout);
> +                                       msecs_to_jiffies(50000));
>
>                         /* Shouldn't happen. There are still references to
>                          * the object but, rather than deadlock, free the
>                          * object or pass back to rdma-core.
>                          */
>                         if (WARN_ON(!ret))
> -                               err = -EINVAL;
> +                               err = -ETIMEDOUT;
>                 }
>         } else {
> -               unsigned long until = jiffies + timeout;
> +               unsigned long until = jiffies + RXE_POOL_TIMEOUT;
>
>                 /* AH objects are unique in that the destroy_ah verb
>                  * can be called in atomic context. This delay
> @@ -226,7 +225,7 @@ int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
>                         mdelay(1);
>
>                 if (WARN_ON(!completion_done(&elem->complete)))
> -                       err = -EINVAL;
> +                       err = -ETIMEDOUT;
>         }
>
>         if (pool->cleanup)
> --
> 2.34.1
>
>





[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux