RE: [PATCH rdma-core] irdma: Restore full memory barrier for doorbell optimization

"Nikolova, Tatyana E" <tatyana.e.nikolova@xxxxxxxxx> · Mon, 9 Aug 2021 20:07:32 +0000

> -----Original Message-----
> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Thursday, August 5, 2021 8:29 PM
> To: Nikolova, Tatyana E <tatyana.e.nikolova@xxxxxxxxx>
> Cc: dledford@xxxxxxxxxx; leon@xxxxxxxxxx; linux-rdma@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH rdma-core] irdma: Restore full memory barrier for
> doorbell optimization
> 
> On Thu, Aug 05, 2021 at 11:11:34AM -0500, Tatyana Nikolova wrote:
> > During the irdma library upstream submission we agreed to replace
> > atomic_thread_fence(memory_order_seq_cst) in the irdma doorbell
> > optimization algorithm with udma_to_device_barrier().
> > However, further regression testing uncovered cases where in absence
> > of a full memory barrier, the algorithm incorrectly skips ringing the
> > doorbell.
> >
> > There has been a discussion about the necessity of a full memory
> > barrier for the doorbell optimization in the past:
> > https://lore.kernel.org/linux-rdma/20170301172920.GA11340@ssaleem-
> MOBL
> > 4.amr.corp.intel.com/
> >
> > The algorithm decides whether to ring the doorbell based on input from
> > the shadow memory (hw_tail). If the hw_tail is behind the sq_head,
> > then the algorithm doesn't ring the doorbell, because it assumes that
> > the HW is still actively processing WQEs.
> >
> > The shadow area indicates the last WQE processed by the HW and it is
> > okay if the shadow area value isn't the most current. However there
> > can't be a window of time between reading the shadow area and setting
> > the valid bit for the last WQE posted, because the HW may go idle and
> > the algorithm won't detect this.
> >
> > The following two cases illustrate this issue and are identical,
> > except for ops ordering. The first case is an example of how the wrong
> > order results in not ringing the doorbell when the HW has gone idle.
> 
> I can't really understand this explanation. since this seemes to be about a
> concurrency problem can you please explain it using a normal ladder
> diagram? (eg 1a3402d93c73 ("posix-cpu-timers: Fix rearm racing against
> process tick") to pick an example at random)
>
Hi Jason,

The doorbell optimization algorithm involves the following operations:

1.	Software writing the valid bit in the WQE.
2.	Software reading shadow memory (hw_tail) value.
3.	Software deciding whether or not to ring the doorbell.
4.	Hardware reading that WQE.

Without the MFENCE after step 1, the processor is free to move instructions around - it can move the write of the valid bit later (after software reads the hw_tail value).  

MFENCE ensures the order of 1 and 2. 

MFENCE (Vol. 1 11-12)
"The MFENCE instruction establishes a memory fence for both loads and stores. The processor ensures that no load or store after MFENCE will become globally visible until all loads and stores before MFENCE are globally visible"

If the order of 1 and 2 is not maintained, the algorithm can fail because the hardware may not see the valid bit and the doorbell optimization algorithm does not ring the doorbell. This causes hardware to go idle even though there is work to do.

The failure scenario without the MFENCE is shown in a ladder diagram:

SOFTWARE                        CPU                                  DMA DEVICE

Writes the valid bit
SFENCE
Reads the hw_tail value

                                              <Reorders instructions>
                                              Reads hw_tail value 

Decides not to ring the doorbell.

                                                                                        Reads WQE - <valid bit is not set and hardware goes idle>
                                              Writes valid bit
			   SFENCE

The order of 1 and 2 is not maintained which causes hardware to go idle even though there is work to do.

Our doorbell optimization algorithm has always required a full memory barrier because it prevents reordering of stores and loads.
There used to be an MFENCE in the i40iw doorbell algorithm. In a previous discussion, https://lore.kernel.org/linux-rdma/20170306194052.GB31672@xxxxxxxxxxxxxxxxxxxx/, you suggested the use of atomic_thread_fence(memory_order_seq_cst), which is a C11 barrier, and this did work for us.

Thank you,
Tatyana

> > diff --git a/providers/irdma/uk.c b/providers/irdma/uk.c index
> > c7053c52..d63996db 100644
> > +++ b/providers/irdma/uk.c
> > @@ -118,7 +118,7 @@ void irdma_uk_qp_post_wr(struct irdma_qp_uk
> *qp)
> >  	__u32 sw_sq_head;
> >
> >  	/* valid bit is written and loads completed before reading shadow */
> > -	udma_to_device_barrier();
> > +	atomic_thread_fence(memory_order_seq_cst);
> 
> Because it certainly looks wrong to replace a DMA barrier with something
> that is not a DMA barrier.
> 
> I'm guessing this problem is that the shadow memory is not locked and also
> not using using atomics to control concurrent access it? If so then the fix is to
> use atomics for the shadow memory and place the proper order
> requirement on the atomic itself.
> 
> Jason