On 31/12/2021 06:18, Tom Talpey wrote: > On 12/28/2021 3:07 AM, Li Zhijian wrote: >> In contrast to other opcodes, after a series of sanity checking, FLUSH >> opcode will do a Placement Type checking before it really do the FLUSH >> operation. Responder will also reply NAK "Remote Access Error" if it >> found a placement type violation. >> >> We will persist data via arch_wb_cache_pmem(), which could be >> architecture specific. >> >> After the execution, responder would reply a responded successfully by >> RDMA READ response of zero size. >> >> Signed-off-by: Li Zhijian <lizhijian@xxxxxxxxxxxxxx> >> --- >> drivers/infiniband/sw/rxe/rxe_hdr.h | 28 ++++++ >> drivers/infiniband/sw/rxe/rxe_loc.h | 2 + >> drivers/infiniband/sw/rxe/rxe_mr.c | 4 +- >> drivers/infiniband/sw/rxe/rxe_resp.c | 131 ++++++++++++++++++++++++++- >> include/uapi/rdma/ib_user_verbs.h | 10 ++ >> 5 files changed, 169 insertions(+), 6 deletions(-) >> > > <snip> > >> +static int nvdimm_flush_iova(struct rxe_mr *mr, u64 iova, int length) >> +{ >> + int err; >> + int bytes; >> + u8 *va; >> + struct rxe_map **map; >> + struct rxe_phys_buf *buf; >> + int m; >> + int i; >> + size_t offset; >> + >> + if (length == 0) >> + return 0; > > The length is only relevant when the flush type is "Memory Region > Range". > > When the flush type is "Memory Region", the entire region must be > flushed successfully before completing the operation. Yes, currently, the length has been expanded to the MR's length in such case. > >> + >> + if (mr->type == IB_MR_TYPE_DMA) { >> + arch_wb_cache_pmem((void *)iova, length); >> + return 0; >> + } > > Are dmamr's supported for remote access? I thought that was > prevented on first principles now. I might suggest not allowing > them to be flushed in any event. There is no length restriction, > and it's a VERY costly operation. At a minimum, protect this > closely. Indeed, I didn't confidence about this, the main logic comes from rxe_mr_copy() Thanks for the suggestion. > >> + >> + WARN_ON_ONCE(!mr->cur_map_set); > > The WARN is rather pointless because the code will crash just > seven lines below. > >> + >> + err = mr_check_range(mr, iova, length); >> + if (err) { >> + err = -EFAULT; >> + goto err1; >> + } >> + >> + lookup_iova(mr, iova, &m, &i, &offset); >> + >> + map = mr->cur_map_set->map + m; >> + buf = map[0]->buf + i; >> + >> + while (length > 0) { >> + va = (u8 *)(uintptr_t)buf->addr + offset; >> + bytes = buf->size - offset; >> + >> + if (bytes > length) >> + bytes = length; >> + >> + arch_wb_cache_pmem(va, bytes); >> + >> + length -= bytes; >> + >> + offset = 0; >> + buf++; >> + i++; >> + >> + if (i == RXE_BUF_PER_MAP) { >> + i = 0; >> + map++; >> + buf = map[0]->buf; >> + } >> + } >> + >> + return 0; >> + >> +err1: >> + return err; >> +} >> + >> +static enum resp_states process_flush(struct rxe_qp *qp, >> + struct rxe_pkt_info *pkt) >> +{ >> + u64 length = 0, start = qp->resp.va; >> + u32 sel = feth_sel(pkt); >> + u32 plt = feth_plt(pkt); >> + struct rxe_mr *mr = qp->resp.mr; >> + >> + if (sel == IB_EXT_SEL_MR_RANGE) >> + length = qp->resp.length; >> + else if (sel == IB_EXT_SEL_MR_WHOLE) >> + length = mr->cur_map_set->length; > > I'm going to have to think about these Yes, you inspire me that we should consider to adjust the start of iova to the MR's start as well. >> + >> + if (plt == IB_EXT_PLT_PERSIST) { >> + nvdimm_flush_iova(mr, start, length); >> + wmb(); // clwb follows by a sfence >> + } else if (plt == IB_EXT_PLT_GLB_VIS) >> + wmb(); // sfence is enough > > The persistence and global visibility bits are not mutually > exclusive, My bad, it ever appeared in my mind. o(╯□╰)o > and in fact persistence does not imply global > visibility in some platforms. If so, and per the SPEC, why not if (plt & IB_EXT_PLT_PERSIST) do_somethingA(); if (plt & IB_EXT_PLT_GLB_VIS) do_somethingB(); > They must be tested and > processed individually. > > if (plt & IB_EXT_PLT_PERSIST) > ... > else if (plt & IB_EXT_PLT_GLB_VIS) > .. > > Second, the "clwb" and "sfence" comments are completely > Intel-specific. good catch. > What processing will be done on other > processor platforms??? I didn't dig other ARCH yet but INTEL. In this function, i think i just need to call the higher level wrapper, like wmb() and arch_wb_cache_pmem are enough, right ? Again, thank you. Thanks. > > Tom.