On 8/31/22 2:45 AM, Tom Talpey wrote: > On 8/29/2022 12:01 AM, Cheng Xu wrote: >> >> >> On 8/26/22 9:57 PM, Jason Gunthorpe wrote: >>> On Fri, Aug 26, 2022 at 09:11:25AM -0400, Tom Talpey wrote: >>> >>>> With your change, ERDMA will pre-emptively fail such a newly posted >>>> request, and generate no new completion. The consumer is left in limbo >>>> on the status of its prior requests. Providers must not override this. >>> >>> Yeah, I tend to agree with Tom. >>> >>> And I also want to point out that Linux RDMA verbs does not follow the >>> SW specifications of either IBTA or the iWarp group. We have our own >>> expectation for how these APIs work that our own ULPs rely on. >>> >>> So pedantically debating what a software spec we don't follow says is >>> not relavent. The utility is to understand the intention and use cases >>> and ensure we cover the same. Usually this means we follow the spec :) >>> >> >> Yeah, I totally agree with this. >> >> Actually, I thought that ULPs do not concern about the details of how the >> flushing and modify_qp being performed in the drivers. The drain flow is >> handled by a single ib_drain_qp call for ULPs. While ib_drain_qp API allows >> vendor-custom implementation, this is invisible to ULPs. >> >> For the ULPs which implement their own drain flow instead of using >> ib_drain_qp (I think it is rare in kernel), they will fail in erdma. >> >> Anyway, since our implementation is disputed, We'd like to keep the same >> behavior with other vendors. Maybe firmware updating w/o driver changes or >> software flushing in driver will fix this. > > To be clear, my concern is about the ordering of CQE flushes with > respect to the WR posting fails. Draining the CQs in whatever way > you choose to optimize for your device is not the issue, although > it seems odd to me that you need such a thing. > > The problem is that your patch started failing the new requests > _before_ the drain could be used to clean up. This introduced > two new provider behaviors that consumers would not expect: > > - first error detected in a post call (on the fast path!) > - inability to determine if prior requests were complete > Yes, you are right. As I replied, we will drop this patch, and follow the common behaviors as other providers do. > I'd really suggest getting a copy of the full IB spec and examining > the difference between QP "Error" and "SQ Error" states. They are > subtle but important. Yeah, I'm already doing this. Thanks very much. Cheng Xu > Tom.