On Thu, Jun 15, 2023 at 11:01:38AM -0500, Bob Pearson wrote: > I am still on a campaign to tighten the screws in the rxe driver. There are a lot of variables that are shared > between the requester task and the completer task (now on work queues) that control resources and error recovery. > There is almost no effort to make sure changes in one thread are visible in the other. The following is a summary: > > In requester task In completer task > qp->req.psn RW R > qp->req.rd_atomic (A) RW W > qp->req.wait_fence W RW > qp->req.need_rd_atomic W RW > qp->req.wait_psn W RW > qp->req.need_retry RW RW > qp->req.wait_for_rnr_timer RW W > > These are all int's except for rd_atomic which is an atomic_t and all properly aligned. > Several of these are similar to wait_fence: > > if (rxe_wqe_is_fenced(qp, wqe) { > qp->req.wait_fence = 1; > goto exit; (the task thread) > } > ... > // completed something > if (qp->req.wait_fence) { > qp->req.wait_fence = 0; > rxe_sched_task(&qp->req.task); > // requester will run at least once > // after this > } > > As long as the write and read actually get executed this will work eventually because the caches are > coherent. But what if they don't? The sched_task implies a memory barrier before the requester task > runs again but it doesn't read wait_fence so it doesn't seem to matter. > > There also may be a race between a second execution of the requester re-setting the flag and the completer > clearing it since someone else (e.g. verbs API could also schedule the requester.) I think the worst > that can happen here is an extra rescheduling which is safe. > > Could add an explicit memory barrier in the requester or matched smp_store_release/smp_load_acquire, > or a spinlock, or WRITE_ONCE/READ_ONCE. I am not sure what, if anything, should be done in this case. > It currently works fine AFAIK on x86/x64 but there are others. It looks really sketchy. This is the requestor hitting a fence opcode and needing to pause processing until the completor reaches the matching barrier? How is this just not completely racy? Forget about caches and barriers. Jason