Re: [RFC PATCH 2/2] RDMA/rxe: Add RDMA Atomic Write operation

Tom Talpey <tom@xxxxxxxxxx> · Fri, 7 Jan 2022 10:50:47 -0500

On 1/5/2022 4:24 AM, yangx.jy@xxxxxxxxxxx wrote:
On 2021/12/31 23:09, Tom Talpey wrote:
On 12/31/2021 3:29 AM, yangx.jy@xxxxxxxxxxx wrote:
On 2021/12/31 5:39, Tom Talpey wrote:

On 12/30/2021 7:14 AM, Xiao Yang wrote:
This patch implements RDMA Atomic Write operation for RC service.

Signed-off-by: Xiao Yang <yangx.jy@xxxxxxxxxxx>
---
   drivers/infiniband/sw/rxe/rxe_comp.c   |  4 +++
   drivers/infiniband/sw/rxe/rxe_opcode.c | 18 ++++++++++
   drivers/infiniband/sw/rxe/rxe_opcode.h |  3 ++
   drivers/infiniband/sw/rxe/rxe_qp.c     |  3 +-
   drivers/infiniband/sw/rxe/rxe_req.c    | 10 ++++--
   drivers/infiniband/sw/rxe/rxe_resp.c   | 49 
+++++++++++++++++++++-----
   include/rdma/ib_pack.h                 |  2 ++
   include/rdma/ib_verbs.h                |  2 ++
   include/uapi/rdma/ib_user_verbs.h      |  2 ++
   9 files changed, 81 insertions(+), 12 deletions(-)

<snip>

+/* Guarantee atomicity of atomic write operations at the machine
level. */
+static DEFINE_SPINLOCK(atomic_write_ops_lock);
+
+static enum resp_states process_atomic_write(struct rxe_qp *qp,
+                         struct rxe_pkt_info *pkt)
+{
+    u64 *src, *dst;
+    struct rxe_mr *mr = qp->resp.mr;
+
+    src = payload_addr(pkt);
+
+    dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset,
sizeof(u64));
+    if (!dst || (uintptr_t)dst & 7)
+        return RESPST_ERR_MISALIGNED_ATOMIC;
+
+    spin_lock_bh(&atomic_write_ops_lock);
+    *dst = *src;
+    spin_unlock_bh(&atomic_write_ops_lock);

Spinlock protection is insufficient! Using a spinlock protects only
the atomicity of the store operation with respect to another remote
atomic_write operation. But the semantics of RDMA_ATOMIC_WRITE go much
further. The operation requires a fully atomic bus transaction, across
the memory complex and with respect to CPU, PCI, and other sources.
And this guarantee needs to apply to all architectures, including ones
with noncoherent caches (software consistency).

Because RXE is a software provider, I believe the most natural approach
here is to use an atomic64_set(dst, *src). But I'm not certain this
is supported on all the target architectures, therefore it may require
some additional failure checks, such as stricter alignment than testing
(dst & 7), or returning a failure if atomic64 operations are not
available. I especially think the ARM and PPC platforms will need
an expert review.
Hi Tom,

Thanks a lot for the detailed suggestion.

I will try to use atomic64_set() and add additional failure checks.
By the way, process_atomic() uses the same method(spinlock + assignment
expression),
so do you think we also need to update it by using atomic64_set()?

It is *criticial* for you to understand that the ATOMIC_WRITE has a
much stronger semantic than ordinary RDMA atomics.

The ordinary atomics are only atomic from the perspective of a single
connection and a single adapter. And the result is not placed in memory
atomically, only its execution in the adapter is processed that way.
And the final result is only consistently visible to the application
after a completion event is delivered. This rather huge compromise is
because when atomics were first designed, there were no PCI primitives
which supported atomics.

An RDMA atomic_write operation is atomic all the way to memory. It
must to be implemented in a platform operation which is similarly
atomic. If implemented by a PCIe adapter, it would use the newer
PCIe atomics to perform a locked read-modify-write. If implemented
in software as you are doing, it must use a similar local r-m-w
instruction, or the platform equivalent if necessary.
Hi Tom,

It's OK to replace "spinlock + assignment expression" with 
atomic64_set(), but atomic64_set() only ensures
the atomicity of write operation as well(In other words, it cannot 
ensure that the data is written into memory atomically)
let's see the definition of atomic64_set() on x86 and arm64:
----------------------------------------------------------------
#define __WRITE_ONCE(x, val)                                            \
do {                                                                    \
         *(volatile typeof(x) *)&(x) = (val);                            \
} while (0)

x86:
static inline void arch_atomic64_set(atomic64_t *v, s64 i)
{
         __WRITE_ONCE(v->counter, i);
}

arm64:
#define arch_atomic64_set                       arch_atomic_set
#define arch_atomic_set(v, i)                   
__WRITE_ONCE(((v)->counter), (i))
----------------------------------------------------------------
*
We may need both atomic64_set() and wmb() here. Do you think so?
---------------------------------------------------------
atomic64_set(dst, *src);
wmb(); *
----------------------------------------------------------------

No, I don't think an explicit wmb() is required *after* the store. The
only requirement is that the ATOMIC_WRITE 64-bit payload is not torn.
It's not necessary to order it with respect to subsequent writes from
other QPs, or even the same one.

In addtion, I think we don't need to check if atomic64_set() is 
available because all arches support atomic64_set().
Many arches have own atomic64_set() and the others use the generic 
atomic64_set(), for example:
----------------------------------------------------------------
x86:
static inline void arch_atomic64_set(atomic64_t *v, s64 i)
{
         __WRITE_ONCE(v->counter, i);
}

generic:
void generic_atomic64_set(atomic64_t *v, s64 i)
{
         unsigned long flags;
         raw_spinlock_t *lock = lock_addr(v);

         raw_spin_lock_irqsave(lock, flags);
         v->counter = i;
         raw_spin_unlock_irqrestore(lock, flags);
}
EXPORT_SYMBOL(generic_atomic64_set);
----------------------------------------------------------------

We're discussing this in the other fork of the thread. I think Jason's
suggestion of using smb_store_mb() has good merit.

Finally, could you tell me how to add stricter alignment than testing 
(dst & 7)?

Alignment is only part of the issue. The instruction set, type of
mapping, and the width of the bus are also important to determine if
a torn write might occur.

Off the top of my head, an uncached mapping would be a problem on an
architecture which did not support 64-bit stores. A cache will merge
32-bit writes, which won't happen if it's disabled. I guess it could
be argued this is uninteresting, in a modern platform, but...

A second might be MMIO space, or a similar dedicated memory block such
as a GPU. It's completely a platform question whether these can provide
an untorn 64-bit write semantic.

Keep in mind that ATOMIC_WRITE allows an application to poll a 64-bit
location to receive a result, without reaping any completion first. The
value must appear after all prior operations have completed. And the
entire 64-bit value must be intact. That's it.

Tom.