Re: [PATCH for-next 0/2] RDMA/erdma: Introduce custom implementation of drain_sq and drain_rq

Cheng Xu <chengyou@xxxxxxxxxxxxxxxxx> · Wed, 31 Aug 2022 10:52:14 +0800

On 8/31/22 2:45 AM, Tom Talpey wrote:
> On 8/29/2022 12:01 AM, Cheng Xu wrote:
>>
>>
>> On 8/26/22 9:57 PM, Jason Gunthorpe wrote:
>>> On Fri, Aug 26, 2022 at 09:11:25AM -0400, Tom Talpey wrote:
>>>
>>>> With your change, ERDMA will pre-emptively fail such a newly posted
>>>> request, and generate no new completion. The consumer is left in limbo
>>>> on the status of its prior requests. Providers must not override this.
>>>
>>> Yeah, I tend to agree with Tom.
>>>
>>> And I also want to point out that Linux RDMA verbs does not follow the
>>> SW specifications of either IBTA or the iWarp group. We have our own
>>> expectation for how these APIs work that our own ULPs rely on.
>>>
>>> So pedantically debating what a software spec we don't follow says is
>>> not relavent. The utility is to understand the intention and use cases
>>> and ensure we cover the same. Usually this means we follow the spec :)
>>>
>>
>> Yeah, I totally agree with this.
>>
>> Actually, I thought that ULPs do not concern about the details of how the
>> flushing and modify_qp being performed in the drivers. The drain flow is
>> handled by a single ib_drain_qp call for ULPs. While ib_drain_qp API allows
>> vendor-custom implementation, this is invisible to ULPs.
>>
>> For the ULPs which implement their own drain flow instead of using
>> ib_drain_qp  (I think it is rare in kernel), they will fail in erdma.
>>
>> Anyway, since our implementation is disputed, We'd like to keep the same
>> behavior with other vendors. Maybe firmware updating w/o driver changes or
>> software flushing in driver will fix this.
> 
> To be clear, my concern is about the ordering of CQE flushes with
> respect to the WR posting fails. Draining the CQs in whatever way
> you choose to optimize for your device is not the issue, although
> it seems odd to me that you need such a thing.
> 
> The problem is that your patch started failing the new requests
> _before_ the drain could be used to clean up. This introduced
> two new provider behaviors that consumers would not expect:
> 
> - first error detected in a post call (on the fast path!)
> - inability to determine if prior requests were complete
> 
Yes, you are right. As I replied, we will drop this patch, and follow
the common behaviors as other providers do.

> I'd really suggest getting a copy of the full IB spec and examining
> the difference between QP "Error" and "SQ Error" states. They are
> subtle but important.

Yeah, I'm already doing this. Thanks very much.

Cheng Xu

> Tom.