On Fri, Jan 12, 2018 at 06:32:47AM +0000, Bart Van Assche wrote: > On Fri, 2018-01-12 at 08:23 +0200, Leon Romanovsky wrote: > > On Thu, Jan 11, 2018 at 07:07:06PM +0000, Bart Van Assche wrote: > > > On Thu, 2018-01-11 at 21:00 +0200, Leon Romanovsky wrote: > > > > On Thu, Jan 11, 2018 at 04:02:33PM +0000, Bart Van Assche wrote: > > > > > On Thu, 2018-01-11 at 08:22 +0200, Leon Romanovsky wrote: > > > > > > The proposed patch definitely decreases the chance of races, but it is not fixing them. > > > > > > There is a chance to have change in qp state immediately after your "if ..." check. > > > > > > > > > > Hello Leon, > > > > > > > > > > Please have a look at rxe_qp_error() and you will see that the patch I posted > > > > > is a proper fix. In the scenario you described rxe_qp_error() will trigger a > > > > > run of rxe_completer(). > > > > > > > > Bart, > > > > > > > > What am I missing? > > > > > > > > CPU1 CPU2 > > > > if (unlikely.... > > > > <--- > > > > /* move the qp to the error state */ > > > > void rxe_qp_error(struct rxe_qp *qp) > > > > { > > > > qp->req.state = QP_STATE_ERROR; > > > > qp->resp.state = QP_STATE_ERROR; > > > > qp->attr.qp_state = IB_QPS_ERR; > > > > ---> > > > > rxe_run_task(&qp->req.task, must_sched); > > > > > > > > > > > > > > > > It is more or less the same as without "if (unlikely..." > > > > > > Hello Leon, > > > > > > In the above the part of rxe_qp_error() that I was referring to in my e-mail > > > is missing: > > > > > > if (qp_type(qp) == IB_QPT_RC) > > > rxe_run_task(&qp->comp.task, 1); > > > > > > But it is exactly where race exists, as long QP isn't protected, it can > > switch CPUs and create race. > > Hello Leon, > > Can you clarify which race you are referring to? rxe_run_task() uses the > tasklet mechanism and tasklets are guaranteed to run on at most one CPU at a > time. See also the "Top and Bottom Halves" chapter in Linux Device Drivers, > 3rd edition. See also the tasklet_schedule() implementation in > <linux/interrupt.h> and in kernel/softirq.c. Ahh, Bart, Sorry. I found the cause of my misunderstanding, it is "comp" in the rxe_run_task call and not "req". Thanks > > Thanks, > > Bart.
Attachment:
signature.asc
Description: PGP signature