On 6/4/2021 12:55 PM, Jason Gunthorpe wrote:
On Fri, Jun 04, 2021 at 12:53:51PM -0500, Pearson, Robert B wrote:
On 6/4/2021 11:22 AM, Pearson, Robert B wrote:
On 6/4/2021 12:37 AM, Zhu Yanjun wrote:
After I added a rxe device on the netdev, then run rdma-core test tools.
Then I remove rxe device, in the end, I unloaded rdma_rxe kernel
modules.
I found the above logs.
"
[ 1249.651921] rdma_rxe: rxe-pd pool destroyed with unfree'd elem
[ 1249.651927] rdma_rxe: rxe-qp pool destroyed with unfree'd elem
[ 1249.651929] rdma_rxe: rxe-cq pool destroyed with unfree'd elem
"
It seems that some resources leak.
I will make further investigations.
Zhu Yanjun
Zhu,
I suspect this is an older error. I traced all the add and drop ref
calls for PDs, then ran the full suite of Python tests and also test_mr
which includes the memory window tests by itself and then counted the
adds and drops. For test_mr alone I get 85 adds and 85 drops but when I
run the whole suite I get 384 adds and 380 drops. Since the memory
window code is only exercised in test_mr I think it is OK. Somewhere
else there are missing drops. I will try to isolate them.
Bob
Zhu,
In rdma_core/tests/test_qpex.py test_qp_ex_rc_atomic_cmp_swp and
test_qp_ex_rc_atomic_fetch_add each have two missing drops of PDs. This is
either a test bug or a bug in the rxe driver but it has nothing to do with
the MW code. We should treat it as a separate error. For some reason these
test cases are not cleaning up all resources.
The cleanup code in all these Python tests is very implicit. It just happens
by magic so it is hard to figure out where an ibv_destroy_qp or
ibv_destroy_cq went missing. It would help if someone who is familiar with
these tests could look at it.
It is impossible for userspace to leak a kernel resource, when the fd
is closed everything is destroyed back to the driver guarenteed by the
kernel.
As long as pyverbs has exited pyverbs cannot be the bug
Jason
Thanks. That helped. Adding traces for QP references turned up the
problem. Someone took a reference on QP in send_atomic_ack() that is
never matched by a drop reference. The logic was probably to protect the
QP from going away while it held an atomic responder resource. The
problem is that since the requester can retry the operation multiple
times the responder never knows when to free the resource so it doesn't.
It just recycles them FIFO when a new atomic request comes along. They
are looked up by PSN. So the only logical solution is to not take the
extra reference. If you destroy the QP while a requester is retrying
atomic operations it just fails.
I'll submit a patch deleting one line.
Bob