Re: [PATCH for-next v8 00/10] RDMA/rxe: Implement memory windows

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 6/4/2021 12:55 PM, Jason Gunthorpe wrote:
On Fri, Jun 04, 2021 at 12:53:51PM -0500, Pearson, Robert B wrote:
On 6/4/2021 11:22 AM, Pearson, Robert B wrote:
On 6/4/2021 12:37 AM, Zhu Yanjun wrote:
After I added a rxe device on the netdev, then run rdma-core test tools.
Then I remove rxe device, in the end, I unloaded rdma_rxe kernel
modules.
I found the above logs.
"
[ 1249.651921] rdma_rxe: rxe-pd pool destroyed with unfree'd elem
[ 1249.651927] rdma_rxe: rxe-qp pool destroyed with unfree'd elem
[ 1249.651929] rdma_rxe: rxe-cq pool destroyed with unfree'd elem
"

It seems that  some resources leak.

I will make further investigations.

Zhu Yanjun
Zhu,

I suspect this is an older error. I traced all the add and drop ref
calls for PDs, then ran the full suite of Python tests and also test_mr
which includes the memory window tests by itself and then counted the
adds and drops. For test_mr alone I get 85 adds and 85 drops but when I
run the whole suite I get 384 adds and 380 drops. Since the memory
window code is only exercised in test_mr I think it is OK. Somewhere
else there are missing drops. I will try to isolate them.

Bob

Zhu,

In rdma_core/tests/test_qpex.py test_qp_ex_rc_atomic_cmp_swp and
test_qp_ex_rc_atomic_fetch_add each have two missing drops of PDs. This is
either a test bug or a bug in the rxe driver but it has nothing to do with
the MW code. We should treat it as a separate error. For some reason these
test cases are not cleaning up all resources.

The cleanup code in all these Python tests is very implicit. It just happens
by magic so it is hard to figure out where an ibv_destroy_qp or
ibv_destroy_cq went missing. It would help if someone who is familiar with
these tests could look at it.
It is impossible for userspace to leak a kernel resource, when the fd
is closed everything is destroyed back to the driver guarenteed by the
kernel.

As long as pyverbs has exited pyverbs cannot be the bug

Jason

Thanks. That helped. Adding traces for QP references turned up the problem. Someone took a reference on QP in send_atomic_ack() that is never matched by a drop reference. The logic was probably to protect the QP from going away while it held an atomic responder resource. The problem is that since the requester can retry the operation multiple times the responder never knows when to free the resource so it doesn't. It just recycles them FIFO when a new atomic request comes along. They are looked up by PSN. So the only logical solution is to not take the extra reference. If you destroy the QP while a requester is retrying atomic operations it just fails.

I'll submit a patch deleting one line.

Bob




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux