Re: [bug report] blktests srp/002 hang

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/17/23 12:58, Jason Gunthorpe wrote:
> On Tue, Oct 17, 2023 at 12:09:31PM -0500, Bob Pearson wrote:
> 
>  
>> For qp#167 the call to srp_post_send() is followed by the rxe driver
>> processing the send operation and generating a work completion which
>> is posted to the send cq but there is never a following call to
>> __srp_get_rx_iu() so the cqe is not received by srp and failure.
> 
> ? I don't see this funcion in the kernel?  __srp_get_tx_iu ?
>  
>> I don't yet understand the logic of the srp driver to fix this but
>> the problem is not in the rxe driver as far as I can tell.
> 
> It looks to me like __srp_get_tx_iu() is following the design pattern
> where the send queue is only polled when it needs to allocate a new
> send buffer - ie the send buffers are pre-allocated and cycle through
> the queue.
> 
> So, it is not surprising this isn't being called if it is hung - the
> hang is probably something that is preventing it from even wanting to
> send, which is probably a receive side issue.
> 
> Followup back up from that point to isolate what is the missing
> resouce to trigger send may bring some more clarity.
> 
> Alternatively if __srp_get_tx_iu() is failing then perhaps you've run
> into an issue where it hit something rare and recovery does not work.
> 
> eg this kind of design pattern carries a subtle assumption that the rx
> and send CQ are ordered together. Getting a rx CQ before a matching tx
> CQ can trigger the unusual scenario where the send side runs out of
> resources.
> 
> Jason

In all the traces I have looked at the hang only occurs once the final
send side completions are not received. This happens when the srp
driver doesn't poll (i.e. call ib_process_cq_direct). The rest is
my conjecture. Since there are several (e.g. qp#167 through qp#211 (odd))
qp's with missing completions there are 23 iu's tied up when srp hangs.
Your suggestion makes sense as why the hang occurs. When the test
finishes the qp's are destroyed and the driver calls ib_process_cq_direct
again which cleans up the resources.

The problem is that there isn't any obvious way to find a thread related
to the missing cqe to poll for them. I think the best way to fix this is
to convert the send side cq handling to interrupt driven (as is the case
with the srpt driver.) The provider drivers have to run in any case to
convert cqe's to wc's so there isn't much penalty to call the cq
completion handler since there is already software running and then you
will get reliable delivery of completions.

Bob



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux