Re: QP reset question

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/31/2021 10:18 PM, Bob Pearson wrote:
On 3/31/21 1:23 PM, Tom Talpey wrote:
On 3/30/2021 5:01 PM, Bob Pearson wrote:
Jason,

Somewhere in Dotan's blog I saw him say that if you put a QP in the reset state that it
- clears the SQ and RQ (if not SRQ) *AND*
- also clears the completion queues

I don't think that second bullet is correct, as you point out the
CQ may hold other entries, not from this QP.

The volume 1 spec does say this around QP destroy in section 10.2.4.4:

It is good programming practice to modify the QP into the Error
state and retrieve the relevant CQEs prior to destroying the QP.
Destroying a QP does not guarantee that CQEs of that QP are
deallocated from the CQ upon destruction. Even if the CQEs are
already on the CQ, it might not be possible to retrieve them. It is
good programming practice not to make any assumption on the number
of CQEs in the CQ when destroying a QP. In order to avoid CQ
overflow, it is recommended that all CQEs of the de-stroyed QP are
retrieved from the CQ associated with it before resizing the CQ,
attaching a new QP to the CQ or reopening the QP, if the CQ
ca-pacity is limited.

There's additional supporting text in 10.3.1 around this. The
QP is always transitioned to Error, then CQEs drained, then QP
to Reset.

In https://www.rdmamojo.com/2012/05/05/qp-state-machine/ it says
Reset state
Description

A QP is being created in the Reset state. In this state, all the needed resources of the QP are already allocated.

In order to reuse a QP, it can be transitioned to Reset state from any state by calling to ibv_modify_qp(). If prior to this state transition, there were any Work Requests or completions in the send or receive queues of that QP, they will be cleared from the queues.

It's too bad he's not here to discuss, but I assert the text is wrong.
Completions are never present on work queues, so it's a contradiction.

I believe that some implementations, including the Mellanox one when
this text was written, basically "promote" a WR to become a CQE as
it moves from work to completion. But that is an implementation
choice. The spec is explicit in separating them, as it should.

Also, as we've pointed out, completion queues are not 1:1 with
send and receive queues. They are commonly shared. Erasing
entries from them is disastrous to the other QPs.

Finally, the state diagram in section 10.3.1 disagrees with the
assertion that a QP can transition to RESET from "any state".
The diagram is explicit in allowing only ERROR->RESET.

Not that he is the final arbiter but it turns out that CX NICs pass these test cases AFAIK. So I am suspicious that
someone is clearing out the CQs somehow. In fact I just found in mlx5: qp.c

if (new_state == IB_QPS_RESET && .... ) {
	mlx5_ib_cq_clean(recv_cq, ....)
	mlx5_ib_cq_clean(send_cq, ....)
}

which seems to be the culprit.

Not the culprit, the instigator! This is a bug.

So in order to be compatible with CX NICs it looks like I need to do the same thing for rxe.

I think the IB spec should be the reference, and it doesn't support
such a choice.




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux