On Tue, May 25, 2021 at 03:11:12PM -0500, Bob Pearson wrote: > The rxe driver has recently begun exhibiting failures in the python > tests that are due to stale values read from the completion queue > producer or consumer indices. Unlike the other loads of these shared > indices those in queue_count() were not protected by smp_load_acquire(). > > This patch replaces loads by smp_load_acquire() in queue_count(). > The observed errors no longer occur. > > Reported-by: Zhu Yanjun <zyj2020@xxxxxxxxx> > Fixes: d21a1240f516 ("RDMA/rxe: Use acquire/release for memory ordering") > Signed-off-by: Bob Pearson <rpearsonhpe@xxxxxxxxx> > drivers/infiniband/sw/rxe/rxe_queue.h | 18 ++++++++++++++++-- > 1 file changed, 16 insertions(+), 2 deletions(-) > > diff --git a/drivers/infiniband/sw/rxe/rxe_queue.h b/drivers/infiniband/sw/rxe/rxe_queue.h > index 2902ca7b288c..5cb142282fa6 100644 > +++ b/drivers/infiniband/sw/rxe/rxe_queue.h > @@ -161,8 +161,22 @@ static inline unsigned int index_from_addr(const struct rxe_queue *q, > > static inline unsigned int queue_count(const struct rxe_queue *q) > { > - return (q->buf->producer_index - q->buf->consumer_index) > - & q->index_mask; > + u32 prod; > + u32 cons; > + u32 count; > + > + /* make sure all changes to queue complete before > + * changing producer index > + */ > + prod = smp_load_acquire(&q->buf->producer_index); > + > + /* make sure all changes to queue complete before > + * changing consumer index > + */ > + cons = smp_load_acquire(&q->buf->consumer_index); > + count = (prod - cons) % q->index_mask; > + > + return count; > } It doesn't quite make sense to load both? Only the one written by userspace should require a load_acquire, the one written by the kernel should already be locked somehow with a kernel lock or there is a different problem But yes, this does look like a bug to me Jason