On Thu, Aug 05, 2021 at 11:11:34AM -0500, Tatyana Nikolova wrote: > During the irdma library upstream submission we agreed to > replace atomic_thread_fence(memory_order_seq_cst) in the irdma > doorbell optimization algorithm with udma_to_device_barrier(). > However, further regression testing uncovered cases where in > absence of a full memory barrier, the algorithm incorrectly > skips ringing the doorbell. > > There has been a discussion about the necessity of a full > memory barrier for the doorbell optimization in the past: > https://lore.kernel.org/linux-rdma/20170301172920.GA11340@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > > The algorithm decides whether to ring the doorbell based on input > from the shadow memory (hw_tail). If the hw_tail is behind the sq_head, > then the algorithm doesn't ring the doorbell, because it assumes that > the HW is still actively processing WQEs. > > The shadow area indicates the last WQE processed by the HW and it is > okay if the shadow area value isn't the most current. However there > can't be a window of time between reading the shadow area and setting > the valid bit for the last WQE posted, because the HW may go idle and > the algorithm won't detect this. > > The following two cases illustrate this issue and are identical, > except for ops ordering. The first case is an example of how > the wrong order results in not ringing the doorbell when the > HW has gone idle. I can't really understand this explanation. since this seemes to be about a concurrency problem can you please explain it using a normal ladder diagram? (eg 1a3402d93c73 ("posix-cpu-timers: Fix rearm racing against process tick") to pick an example at random) > diff --git a/providers/irdma/uk.c b/providers/irdma/uk.c > index c7053c52..d63996db 100644 > +++ b/providers/irdma/uk.c > @@ -118,7 +118,7 @@ void irdma_uk_qp_post_wr(struct irdma_qp_uk *qp) > __u32 sw_sq_head; > > /* valid bit is written and loads completed before reading shadow */ > - udma_to_device_barrier(); > + atomic_thread_fence(memory_order_seq_cst); Because it certainly looks wrong to replace a DMA barrier with something that is not a DMA barrier. I'm guessing this problem is that the shadow memory is not locked and also not using using atomics to control concurrent access it? If so then the fix is to use atomics for the shadow memory and place the proper order requirement on the atomic itself. Jason