On Tue, Jun 8, 2021 at 10:01 AM Pearson, Robert B <rpearsonhpe@xxxxxxxxx> wrote: > > > On 6/7/2021 8:39 PM, Zhu Yanjun wrote: > > On Tue, Jun 8, 2021 at 12:14 AM Pearson, Robert B <rpearsonhpe@xxxxxxxxx> wrote: > >> > >> On 6/7/2021 6:12 AM, Zhu Yanjun wrote: > >>> On Mon, Jun 7, 2021 at 7:03 PM Leon Romanovsky <leon@xxxxxxxxxx> wrote: > >>>> On Mon, Jun 07, 2021 at 04:16:37PM +0800, Zhu Yanjun wrote: > >>>>> On Sat, Jun 5, 2021 at 7:07 AM Bob Pearson <rpearsonhpe@xxxxxxxxx> wrote: > >>>>>> Currently the rdma_rxe driver attempts to protect atomic responder > >>>>>> resources by taking a reference to the qp which is only freed when the > >>>>>> resource is recycled for a new read or atomic operation. This means that > >>>>>> in normal circumstances there is almost always an extra qp reference > >>>>>> once an atomic operation has been executed which prevents cleaning up > >>>>>> the qp and associated pd and cqs when the qp is destroyed. > >>>>>> > >>>>>> This patch removes the call to rxe_add_ref() in send_atomic_ack() and the > >>>>>> call to rxe_drop_ref() in free_rd_atomic_resource(). If the qp is > >>>>> Not sure if it is a good way to fix this problem by removing the call > >>>>> to rxe_add_ref. > >>>>> Because taking a reference to the qp is to protect atomic responder resources. > >>>>> > >>>>> Removing rxe_add_ref is to decrease the protection of the atomic > >>>>> responder resources. > >>>> All those rxe_add_ref/rxe_drop_ref in RXE are horrid. It will be good to delete them all. > >>>> > >>> I made tests with this commit. After this commit is applied, this > >>> problem disappeared. > >> You were testing MW when you saw this bug. Does that mean that now MW is > >> working for you? > > Your MW patches are huge. After these patches are applied, I found 2 > > problems in my test environment. > > The trace you showed looked like the pyverbs tests all passed and then > there were leaked QP/PD/CQ. I also saw those. After fixing the QP > reference count bug (not in MW) I did not see any errors from the > pyverbs tests of MW. Or any other errors for that matter. What was the > other problem? Was that the memory barrier one (also not in MW)? > > Mostly I want to know if you currently see any errors in the kernel > related to MW. The test case bug (in test_qpex.py) is a separate issue The current test cases in rdma-core just confirm a regression in RXE. Zhu Yanjun > that is not a rxe bug at all. > > Bob > > > So IMO, can you send the test cases about MW to rdma-core? So we can > > verify these MW patches with them. > > > > In previous mails, you mentioned these MW test cases. > > > > Thanks a lot. > > Zhu Yanjun > > > >>> Zhu Yanjun > >>> > >>>> Thanks