On 6/7/2021 10:48 PM, Zhu Yanjun wrote:
On Tue, Jun 8, 2021 at 10:01 AM Pearson, Robert B <rpearsonhpe@xxxxxxxxx> wrote:
On 6/7/2021 8:39 PM, Zhu Yanjun wrote:
On Tue, Jun 8, 2021 at 12:14 AM Pearson, Robert B <rpearsonhpe@xxxxxxxxx> wrote:
On 6/7/2021 6:12 AM, Zhu Yanjun wrote:
On Mon, Jun 7, 2021 at 7:03 PM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
On Mon, Jun 07, 2021 at 04:16:37PM +0800, Zhu Yanjun wrote:
On Sat, Jun 5, 2021 at 7:07 AM Bob Pearson <rpearsonhpe@xxxxxxxxx> wrote:
Currently the rdma_rxe driver attempts to protect atomic responder
resources by taking a reference to the qp which is only freed when the
resource is recycled for a new read or atomic operation. This means that
in normal circumstances there is almost always an extra qp reference
once an atomic operation has been executed which prevents cleaning up
the qp and associated pd and cqs when the qp is destroyed.
This patch removes the call to rxe_add_ref() in send_atomic_ack() and the
call to rxe_drop_ref() in free_rd_atomic_resource(). If the qp is
Not sure if it is a good way to fix this problem by removing the call
to rxe_add_ref.
Because taking a reference to the qp is to protect atomic responder resources.
Removing rxe_add_ref is to decrease the protection of the atomic
responder resources.
All those rxe_add_ref/rxe_drop_ref in RXE are horrid. It will be good to delete them all.
I made tests with this commit. After this commit is applied, this
problem disappeared.
You were testing MW when you saw this bug. Does that mean that now MW is
working for you?
Your MW patches are huge. After these patches are applied, I found 2
problems in my test environment.
The trace you showed looked like the pyverbs tests all passed and then
there were leaked QP/PD/CQ. I also saw those. After fixing the QP
reference count bug (not in MW) I did not see any errors from the
pyverbs tests of MW. Or any other errors for that matter. What was the
other problem? Was that the memory barrier one (also not in MW)?
Mostly I want to know if you currently see any errors in the kernel
related to MW. The test case bug (in test_qpex.py) is a separate issue
The current test cases in rdma-core just confirm a regression in RXE.
Zhu Yanjun
Which test cases are you referring to. Currently all test cases either
pass or are skipped because they are not supported with one single
exception. That test in test_qpex.py is *not* a regression. It used to
skip until I added support for the extended MW bind operation to the
user code today. It now fails because the test is actually wrong. It
didn't set the access flags for the MR to support bind MW so the driver
fails the WR with a bind MW error which is the correct behavior. The
traditional QP WR API (ibv_post_send) exercises the same exact
functionality of the driver and they all set the MR access correctly and
pass. There are *no* actual errors being reported by the rdma-core tests.
Bob
that is not a rxe bug at all.
Bob
So IMO, can you send the test cases about MW to rdma-core? So we can
verify these MW patches with them.
In previous mails, you mentioned these MW test cases.
Thanks a lot.
Zhu Yanjun
Zhu Yanjun
Thanks