Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 1/4/2022 3:51 AM, lizhijian@xxxxxxxxxxx wrote:


On 31/12/2021 10:32, Tom Talpey wrote:

On 12/30/2021 8:37 PM, lizhijian@xxxxxxxxxxx wrote
+static int nvdimm_flush_iova(struct rxe_mr *mr, u64 iova, int length)
+{
+    int            err;
+    int            bytes;
+    u8            *va;
+    struct rxe_map        **map;
+    struct rxe_phys_buf    *buf;
+    int            m;
+    int            i;
+    size_t            offset;
+
+    if (length == 0)
+        return 0;

The length is only relevant when the flush type is "Memory Region
Range".

When the flush type is "Memory Region", the entire region must be
flushed successfully before completing the operation.

Yes, currently, the length has been expanded to the MR's length in such case.

I'll dig deeper, but this is not immediately clear in this context.
A zero-length operation is however valid, even though it's a no-op,

If it's no-op, what shall we do in this case.

It's a no-op only in the case that the flush has the MRR (range)
flag set. When the Region flag is set, the length is ignored. As
you point out, the length is determined in process_flush().

I think the confusion arises because the routine is being used
for both cases, and the zero-length case is only valid for one.
To me, it would make more sense to make this test before calling
it. Something like:

+static enum resp_states process_flush(struct rxe_qp *qp,
+				       struct rxe_pkt_info *pkt)
+{
+	u64 length = 0, start = qp->resp.va;
+	u32 sel = feth_sel(pkt);
+	u32 plt = feth_plt(pkt);
+	struct rxe_mr *mr = qp->resp.mr;
+
+	if (sel == IB_EXT_SEL_MR_RANGE)
+		length = qp->resp.length;
+	else if (sel == IB_EXT_SEL_MR_WHOLE)
	{
+		length = mr->cur_map_set->length;
		WARN_ON(length == 0);
	}

	if (length > 0) {
		// there may be optimizations possible here...
		if (plt & IB_EXT_PLT_PERSIST)
			(flush to persistence)
		if (plt & IB_EXT_PLT_GLB_VIS)
			(flush to global visibility)
	}

+	/* set RDMA READ response of zero */
+	qp->resp.resid = 0;
+
+	return RESPST_READ_REPLY;
+}

Tom.



the application may use it to ensure certain ordering constraints.
Will comment later if I have a specific concern.

kindly welcome :)





+
+    if (mr->type == IB_MR_TYPE_DMA) {
+        arch_wb_cache_pmem((void *)iova, length);
+        return 0;
+    }

Are dmamr's supported for remote access? I thought that was
prevented on first principles now. I might suggest not allowing
them to be flushed in any event. There is no length restriction,
and it's a VERY costly operation. At a minimum, protect this
closely.
Indeed, I didn't confidence about this, the main logic comes from rxe_mr_copy()
Thanks for the suggestion.

Well, be careful when following semantics from local behaviors. When you
are processing a command from the wire, extreme caution is needed.
ESPECIALLY WHEN IT'S A DMA_MR, WHICH PROVIDES ACCESS TO ALL PHYSICAL!!!

Sorry for shouting. :)

Never mind :)






+
+static enum resp_states process_flush(struct rxe_qp *qp,
+                       struct rxe_pkt_info *pkt)
+{
+    u64 length = 0, start = qp->resp.va;
+    u32 sel = feth_sel(pkt);
+    u32 plt = feth_plt(pkt);
+    struct rxe_mr *mr = qp->resp.mr;
+
+    if (sel == IB_EXT_SEL_MR_RANGE)
+        length = qp->resp.length;
+    else if (sel == IB_EXT_SEL_MR_WHOLE)
+        length = mr->cur_map_set->length;

I'm going to have to think about these

Yes, you inspire me that we should consider to adjust the start of iova to the MR's start as well.

I'll review again when you decide.
+
+    if (plt == IB_EXT_PLT_PERSIST) {
+        nvdimm_flush_iova(mr, start, length);
+        wmb(); // clwb follows by a sfence
+    } else if (plt == IB_EXT_PLT_GLB_VIS)
+        wmb(); // sfence is enough

The persistence and global visibility bits are not mutually
exclusive,
My bad, it ever appeared in my mind. o(╯□╰)o




and in fact persistence does not imply global
visibility in some platforms.
If so, and per the SPEC, why not
      if (plt & IB_EXT_PLT_PERSIST)
         do_somethingA();
      if (plt & IB_EXT_PLT_GLB_VIS)
         do_somethingB();

At the simplest, yes. But there are many subtle other possibilities.

The global visibility is oriented toward supporting distributed
shared memory workloads, or publish/subscribe on high scale systems.
It's mainly about ensuring that all devices and CPUs see the data,
something ordinary RDMA Write does not guarantee. This often means
flushing write pipelines, possibly followed by invalidating caches.

The persistence, of course, is about ensuring the data is stored. This
is entirely different from making it visible.

In fact, you often want to optimize the two scenarios together, since
these operations are all about optimizing latency. The choice is going
to depend greatly on the behavior of the platform itself. For example,
certain caches provide persistence when batteries are present. So,
providing persistence and providing visibility are one and the same.
No need to do that twice.

On the other hand, some systems may provide persistence only after a
significant cost, such as writing into flash from a DRAM buffer, or
when an Optane DIMM becomes overloaded from continuous writes and
begins to throttle them. The flush may need some rather intricate
processing, to avoid deadlock.

Your code does not appear to envision these, so the simple way might
be best for now. But you should consider other situations.

Thanks a lot for your detailed explanation.




Second, the "clwb" and "sfence" comments are completely
Intel-specific.
good catch.


What processing will be done on other
processor platforms???

I didn't dig other ARCH yet but INTEL.
In this function, i think i just need to call the higher level wrapper, like wmb() and
arch_wb_cache_pmem are enough, right ?

Well, higher level wrappers may signal errors, for example if they're
not available or unreliable, and you will need to handle them. Also,
they may block. Is that ok in this context?

Good question.

My bad, i forgot to check to return value of nvdimm_flush_iova() previously.

But if you mean we should guarantee arch_wb_cache_pmem() and wmb(), i have not
idea yet.

arch_wb_cache_pmem() and wmb(), What i'm currently using are just to hide
the assembly instructions on different architectures. And they are void return.

I wonder if we can always assume they work always When the code reaches them.
Since all current local flushing to pmem do something like that AFAIK(Feel free
to correct me if i'm wrong)

Thanks
Zhijian


You need to get this right on all platforms which will run this code.
It is not acceptable to guess at whether the data is in the required
state before completing the RDMA_FLUSH. If this is not guaranteed,
the operation must raise the required error. To do anything else will
violate the protocol contract, and the remote application may fail.



Tom.



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux