RE: [PATCH rdma-next 1/1] RDMA/mana_ib: process QP error events

Konstantin Taranov <kotaranov@xxxxxxxxxxxxx> · Thu, 6 Jun 2024 08:49:33 +0000

> > +
> > +	switch (event->type) {
> > +	case GDMA_EQE_RNIC_QP_FATAL:
> > +		qpn = event->details[0];
> > +		xa_lock_irqsave(&mdev->qp_table_rq, flag);
> > +		qp = xa_load(&mdev->qp_table_rq, qpn);
> > +		if (qp)
> > +			refcount_inc(&qp->refcount);
> 
> 
> Move this to after checking for "if (!qp) break".

Then we can have a race condition.
Imagine that EQE came when we tried to destroy QP and the got the following execution order:
           EQE             |        QP destroy
1    xa_lock            |
2    qp = xa_load   |
3    xa_unlock       |
4                              |         xa_erase_irq
5                             |          refcount_dec
6                             |          complete
7  refcount_inc     |                                                   <-------- BUG

> 
> > +		xa_unlock_irqrestore(&mdev->qp_table_rq, flag);
> > +		if (!qp)
> > +			break;
> > +		if (qp->ibqp.event_handler) {
> > +			ev.device = qp->ibqp.device;
> > +			ev.element.qp = &qp->ibqp;
> > +			ev.event = IB_EVENT_QP_FATAL;
> > +			qp->ibqp.event_handler(&ev, qp->ibqp.qp_context);
> > +		}
> > +		if (refcount_dec_and_test(&qp->refcount))
> > +			complete(&qp->free);
> > +		break;

> Strange logic. Why not do:
> if (!refcount_dec_and_test(&qp->refcount))
> 	wait_for_completion(&qp->free);
> 

It might work, but the logic will be even stranger and it will prevent some debugging.
With the proposed change, qp->free may not be completed even though the counter is 0.
As a result, the change makes an incorrect state to be an expected state, thereby making bugs with that side effect undetectable.
E.g., we have a bug "use after free" and then we try to trace whether qp was in use.
Plus, it is a good practice deinit everything that was inited. With the proposed change it is violated.