> -----Original Message----- > From: Jason Gunthorpe <jgg@xxxxxxxxxx> > Sent: Thursday, August 19, 2021 5:44 PM > To: Nikolova, Tatyana E <tatyana.e.nikolova@xxxxxxxxx> > Cc: dledford@xxxxxxxxxx; leon@xxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx > Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier for > doorbell optimization > > On Thu, Aug 19, 2021 at 10:01:50PM +0000, Nikolova, Tatyana E wrote: > > > > > > > From: Jason Gunthorpe <jgg@xxxxxxxxxx> > > > Sent: Wednesday, August 18, 2021 11:50 AM > > > To: Nikolova, Tatyana E <tatyana.e.nikolova@xxxxxxxxx> > > > Cc: dledford@xxxxxxxxxx; leon@xxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx > > > Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier > > > for doorbell optimization > > > > > > On Fri, Aug 13, 2021 at 05:25:49PM -0500, Tatyana Nikolova wrote: > > > > >> 1. Software writing the valid bit in the WQE. > > > > >> 2. Software reading shadow memory (hw_tail) value. > > > > > > > > > You are missing an ordered atomic on this read it looks like > > > > > > > > Hi Jason, > > > > > > > > Why do you think we need atomic ops in this case? We aren't trying > > > > to protect from multiple threads but CPU re-ordering of a write > > > > and a read. > > > > > > Which is what the atomics will do. > > > > > > Barriers are only appropriate when you can't add atomic markers to > > > the actual data that needs ordering. > > > > Hi Jason, > > > > We aren't sure what you mean by atomic markers. We ran a few > > experiments with atomics, but none of the barriers we tried > > smp_mb__{before,after}_atomic(), smp_load_acquire() and > > smp_store_release() translates to a full memory barrier on X86. > > Huh? Those are kernel primitives, this is a userspace patch. > > Userspace follows the C11 atomics memory model. > > So I'd expect > > atomic_store_explicit(tail, memory_order_release) > atomic_load_explicit(tail, memory_order_acquire) > > To be the atomics you need. This will ensure that the read/writes to valid > before the atomics are sequenced correctly, eg no CPU thread can observe > an updated tail without also observing the set valid. > Hi Jason, We tried these atomic ops as shown bellow, but they don't fix the issue. atomic_store_explicit(hdr, memory_order_release) atomic_load_explicit(tail, memory_order_acquire) In assembly they look like this: //set_64bit_val(wqe, 24, hdr); atomic_store_explicit((_Atomic(uint64_t) *)(wqe + (24 >> 3)), hdr, memory_order_release); 2130: 49 89 5f 18 mov %rbx,0x18(%r15) /root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:747 /root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:123 temp = atomic_load_explicit((_Atomic(__u64) *)qp->shadow_area, memory_order_acquire); 1c32: 15 00 00 28 84 adc $0x84280000,%eax However, the following works: atomic_store_explicit(hdr, memory_order_seq_cst) //set_64bit_val(wqe, 24, hdr); atomic_store_explicit((_Atomic(uint64_t) *)(wqe + (24 >> 3)), hdr, memory_order_seq_cst); 2130: 49 89 5f 18 mov %rbx,0x18(%r15) 2134: 0f ae f0 mfence /root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:748 atomic_load_explicit(tail, memory_order_seq_cst) - same assembly as with memory_order_acquire Thank you, Tatyana