Re: [bug report] blktests srp/002 hang

Bob Pearson <rpearsonhpe@xxxxxxxxx> · Tue, 17 Oct 2023 14:55:25 -0500

On 10/17/23 13:51, Jason Gunthorpe wrote:
> On Tue, Oct 17, 2023 at 01:44:58PM -0500, Bob Pearson wrote:
>> On 10/17/23 12:58, Jason Gunthorpe wrote:
>>> On Tue, Oct 17, 2023 at 12:09:31PM -0500, Bob Pearson wrote:
>>>
>>>  
>>>> For qp#167 the call to srp_post_send() is followed by the rxe driver
>>>> processing the send operation and generating a work completion which
>>>> is posted to the send cq but there is never a following call to
>>>> __srp_get_rx_iu() so the cqe is not received by srp and failure.
>>>
>>> ? I don't see this funcion in the kernel?  __srp_get_tx_iu ?
>>>  
>>>> I don't yet understand the logic of the srp driver to fix this but
>>>> the problem is not in the rxe driver as far as I can tell.
>>>
>>> It looks to me like __srp_get_tx_iu() is following the design pattern
>>> where the send queue is only polled when it needs to allocate a new
>>> send buffer - ie the send buffers are pre-allocated and cycle through
>>> the queue.
>>>
>>> So, it is not surprising this isn't being called if it is hung - the
>>> hang is probably something that is preventing it from even wanting to
>>> send, which is probably a receive side issue.
>>>
>>> Followup back up from that point to isolate what is the missing
>>> resouce to trigger send may bring some more clarity.
>>>
>>> Alternatively if __srp_get_tx_iu() is failing then perhaps you've run
>>> into an issue where it hit something rare and recovery does not work.
>>>
>>> eg this kind of design pattern carries a subtle assumption that the rx
>>> and send CQ are ordered together. Getting a rx CQ before a matching tx
>>> CQ can trigger the unusual scenario where the send side runs out of
>>> resources.
>>>
>>> Jason
>>
>> In all the traces I have looked at the hang only occurs once the final
>> send side completions are not received. This happens when the srp
>> driver doesn't poll (i.e. call ib_process_cq_direct). The rest is
>> my conjecture. Since there are several (e.g. qp#167 through qp#211 (odd))
>> qp's with missing completions there are 23 iu's tied up when srp hangs.
>> Your suggestion makes sense as why the hang occurs. When the test
>> finishes the qp's are destroyed and the driver calls ib_process_cq_direct
>> again which cleans up the resources.
>>
>> The problem is that there isn't any obvious way to find a thread related
>> to the missing cqe to poll for them. I think the best way to fix this is
>> to convert the send side cq handling to interrupt driven (as is the case
>> with the srpt driver.) The provider drivers have to run in any case to
>> convert cqe's to wc's so there isn't much penalty to call the cq
>> completion handler since there is already software running and then you
>> will get reliable delivery of completions.
> 
> Can you add tracing to show that SRP is running out of SQ resources,
> ie __srp_get_tx_iu() fails and that is a precondition for the hang?
> 
> I am fully willing to belive that is not ever tested.
> 
> Otherwise if srp thinks it has SQ resources then the SQ is probably
> not the cause of the hang.
> 
> Jason

Well.... the extra tracing did *not* show srp running out of iu's.
So I converted cq handling to IB_POLL_SOFTIRQ from IB_POLL_DIRECT.
This required adding a spinlock around list_add(&iu->list, ...) in 
srp_send_done(). The test now runs with all the completions handled
correctly. But, it still hangs. So a red herring.

The hunt continues.

Bob