RE: [PATCH rdma-core] irdma: Restore full memory barrier for doorbell optimization

"Nikolova, Tatyana E" <tatyana.e.nikolova@xxxxxxxxx> · Thu, 2 Sep 2021 16:27:37 +0000

> -----Original Message-----
> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Thursday, August 19, 2021 5:44 PM
> To: Nikolova, Tatyana E <tatyana.e.nikolova@xxxxxxxxx>
> Cc: dledford@xxxxxxxxxx; leon@xxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier for
> doorbell optimization
> 
> On Thu, Aug 19, 2021 at 10:01:50PM +0000, Nikolova, Tatyana E wrote:
> >
> >
> > > From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> > > Sent: Wednesday, August 18, 2021 11:50 AM
> > > To: Nikolova, Tatyana E <tatyana.e.nikolova@xxxxxxxxx>
> > > Cc: dledford@xxxxxxxxxx; leon@xxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx
> > > Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier
> > > for doorbell optimization
> > >
> > > On Fri, Aug 13, 2021 at 05:25:49PM -0500, Tatyana Nikolova wrote:
> > > > >> 1.	Software writing the valid bit in the WQE.
> > > > >> 2.	Software reading shadow memory (hw_tail) value.
> > > >
> > > > > You are missing an ordered atomic on this read it looks like
> > > >
> > > > Hi Jason,
> > > >
> > > > Why do you think we need atomic ops in this case? We aren't trying
> > > > to protect from multiple threads but CPU re-ordering of a write
> > > > and a read.
> > >
> > > Which is what the atomics will do.
> > >
> > > Barriers are only appropriate when you can't add atomic markers to
> > > the actual data that needs ordering.
> >
> > Hi Jason,
> >
> > We aren't sure what you mean by atomic markers. We ran a few
> > experiments with atomics, but none of the barriers we tried
> > smp_mb__{before,after}_atomic(), smp_load_acquire() and
> > smp_store_release() translates to a full memory barrier on X86.
> 
> Huh? Those are kernel primitives, this is a userspace patch.
> 
> Userspace follows the C11 atomics memory model.
> 
> So I'd expect
> 
>   atomic_store_explicit(tail, memory_order_release)
>   atomic_load_explicit(tail, memory_order_acquire)
> 
> To be the atomics you need. This will ensure that the read/writes to valid
> before the atomics are sequenced correctly, eg no CPU thread can observe
> an updated tail without also observing the set valid.
> 

Hi Jason,

We tried these atomic ops as shown bellow, but they don't fix the issue.

atomic_store_explicit(hdr, memory_order_release) atomic_load_explicit(tail, memory_order_acquire)

In assembly they look like this:

//set_64bit_val(wqe, 24, hdr);
atomic_store_explicit((_Atomic(uint64_t) *)(wqe + (24 >> 3)), hdr, memory_order_release);
                     2130:       49 89 5f 18             mov    %rbx,0x18(%r15)
/root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:747

/root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:123
        temp = atomic_load_explicit((_Atomic(__u64) *)qp->shadow_area, memory_order_acquire);
    1c32:       15 00 00 28 84          adc    $0x84280000,%eax

However, the following works:
 atomic_store_explicit(hdr, memory_order_seq_cst)

//set_64bit_val(wqe, 24, hdr);
 atomic_store_explicit((_Atomic(uint64_t) *)(wqe + (24 >> 3)), hdr,  memory_order_seq_cst);
    2130:       49 89 5f 18             mov    %rbx,0x18(%r15)
    2134:       0f ae f0                mfence
/root/CVL-3.0-V26.4C00390/rdma-core-27.0/build/../providers/irdma/uk.c:748

atomic_load_explicit(tail, memory_order_seq_cst) - same assembly as with memory_order_acquire

Thank you,
Tatyana