On Tue, May 25, 2021 at 01:09:01PM -0500, Pearson, Robert B wrote: > > On 5/25/2021 10:23 AM, Pearson, Robert B wrote: > > On further reflection I realize I did not understand correctly the user/kernel API issue correctly. I was assuming that the user application should continue to run but that we could require re-compiling rdma-core. If we require that old rdma-core binaries run on newer kernels then the 40 bytes is an issue. I always recompiled rdma-core and didn't test running with old binaries. Fortunately there is an easy fix. The flags field in the earlier rxe mw version had one bit in it but the new version dropped that and I never went back and removed the field. Dropping the flags field doesn't break anything but lets the mw struct fit in the wr union without extending it. > > > > I will fix, retest and resubmit. > > > > Bob > > > > From: Zhu Yanjun <zyjzyj2000@xxxxxxxxx> > > Sent: Tuesday, May 25, 2021 10:00 AM > > To: Pearson, Robert B <robert.pearson2@xxxxxxx> > > Cc: Pearson, Robert B <rpearsonhpe@xxxxxxxxx>; Jason Gunthorpe <jgg@xxxxxxxxxx>; RDMA mailing list <linux-rdma@xxxxxxxxxxxxxxx> > > Subject: Re: [PATCH for-next v7 00/10] RDMA/rxe: Implement memory windows > > > > On Tue, May 25, 2021 at 1:27 PM Pearson, Robert B <robert.pearson2@xxxxxxx> wrote: > > > There's nothing to change. There is no problem. Just get the headers sync'ed. > > > If that doesn't fix your issues your tree has gotten corrupted somehow. But, I don't think that is the issue. I saw the same type of errors you reported when rdma_core is built with the old header file. That definitely will cause problems. The size of the send queue WQEs changed because new fields were added. Then user space and the kernel immediately get off from each other. > > > > > > Good luck, > > About rdma-core, the root cause is clear. I am fine with this patch series. > > Thanks, Bob. > > > > Zhu Yanjun > > > Well. Interesting. Having pulled latest rdma-core again and fixed the wr.mw > size issue I now see a bunch of CQ and QP errors which have nothing to do > with the memory windows patches. It looks more like a memory ordering > problem around the queues. Is this possibly related to the recent relaxed > ordering changes?? They haven't been merged and wouldn't effect a SW driver like rxe > The one py test failure I have chased down is in the resize cq > test. The first time it runs after building a new module I can print > out the new cqe and the current queue count and see the expected 1 > which is less than 6 but the code takes the wrong branch and does > not report an error. Rerunning the test I get the expected behavior > and the test passes. This will take a bit of effort. Bisect the kernel? Jason