On 7/24/22 23:00, lizhijian@xxxxxxxxxxx wrote: > > > On 22/07/2022 02:18, Bob Pearson wrote: >> On 7/20/22 05:50, Haris Iqbal wrote: >>> On Wed, Jul 20, 2022 at 12:22 PM Li Zhijian <lizhijian@xxxxxxxxxxx> wrote: >>>> Below 2 commits will be reverted: >>>> 8ff5f5d9d8cf ("RDMA/rxe: Prevent double freeing rxe_map_set()") >>>> 647bf13ce944 ("RDMA/rxe: Create duplicate mapping tables for FMRs") >>>> >>>> The community has a few bug reports which pointed this commit at last. >>>> Some proposals are raised up in the meantime but all of them have no >>>> follow-up operation. >>>> >>>> The previous commit led the map_set of FMR to be not avaliable any more if >>>> the MR is registered again after invalidating. Although the mentioned >>>> patch try to fix a potential race in building/accessing the same table >>>> for fast memory regions, it broke rnbd etc ULPs. Since the latter could >>>> be worse, revert this patch. >>>> >>>> With previous commit, it's observed that a same MR in rnbd server will >>>> trigger below code path: >>> Looks Good. I tested the patch against rdma for-next and it solves the >>> problem mentioned in the commit. >>> One small nitpick. It should be rtrs, and not rnbd in the commit message. >>> >>> Feel free to add my, >>> >>> Tested-by: Md Haris Iqbal <haris.iqbal@xxxxxxxxx> >>> >> Li, >> >> It has been a while since this was added. If I recall there was a problem in rnfs >> that this was supposed to fix. It was also supposed to allow overlap of using the >> previous mappings and the driver creating new ones. But it seems that most fmr >> based ulps don't require it, maybe all. Before we do this we should make sure that >> blktests, srp, lustre, rnfs, etc all work. Have these been tested? > > blktests(nvme over RXE and srp) works fine after this reverting. > lustre and rnfs have not tested because I have no lustre and rnfs local environment currently. > > I do wish to know what's the original problem you fixed in 647bf13ce944 ("RDMA/rxe: Create duplicate mapping tables for FMRs") > Could we have other approaches for it such as add locks to prevent the potential *race*. > > I agreed on the view[1]("you need to go back to one map") from Jason > > [1]: https://lore.kernel.org/all/20220527124240.GB2960187@xxxxxxxx/ > > Thanks > ZHijian > >> >> Bob Li, I agree. You can add Reviewed-by: Bob Pearson <rpearsonhpe@xxxxxxxxx> Our Lustre testing is still on older versions of the driver with one map and it works fine. I am not able to reproduce the rnfs results from last year so I just don't know. I still have failures in srp blktests but I doubt it is related to this issue. Tests 002 and 011 seem to hang and I have never been able to figure out why. I suspect that mr->state can be racy. It is a state machine that can trigger changes from client code or tasklet code on the request side or the response side. I don't have solid evidence that this has happened but it seems to me like a good idea to guard the state machine with a spin lock. I will post a patch that does that. Bob