Re: [bug report] blktests srp/002 hang

Jason Gunthorpe <jgg@xxxxxxxx> · Tue, 17 Oct 2023 15:51:39 -0300

On Tue, Oct 17, 2023 at 01:44:58PM -0500, Bob Pearson wrote:
> On 10/17/23 12:58, Jason Gunthorpe wrote:
> > On Tue, Oct 17, 2023 at 12:09:31PM -0500, Bob Pearson wrote:
> > 
> >  
> >> For qp#167 the call to srp_post_send() is followed by the rxe driver
> >> processing the send operation and generating a work completion which
> >> is posted to the send cq but there is never a following call to
> >> __srp_get_rx_iu() so the cqe is not received by srp and failure.
> > 
> > ? I don't see this funcion in the kernel?  __srp_get_tx_iu ?
> >  
> >> I don't yet understand the logic of the srp driver to fix this but
> >> the problem is not in the rxe driver as far as I can tell.
> > 
> > It looks to me like __srp_get_tx_iu() is following the design pattern
> > where the send queue is only polled when it needs to allocate a new
> > send buffer - ie the send buffers are pre-allocated and cycle through
> > the queue.
> > 
> > So, it is not surprising this isn't being called if it is hung - the
> > hang is probably something that is preventing it from even wanting to
> > send, which is probably a receive side issue.
> > 
> > Followup back up from that point to isolate what is the missing
> > resouce to trigger send may bring some more clarity.
> > 
> > Alternatively if __srp_get_tx_iu() is failing then perhaps you've run
> > into an issue where it hit something rare and recovery does not work.
> > 
> > eg this kind of design pattern carries a subtle assumption that the rx
> > and send CQ are ordered together. Getting a rx CQ before a matching tx
> > CQ can trigger the unusual scenario where the send side runs out of
> > resources.
> > 
> > Jason
> 
> In all the traces I have looked at the hang only occurs once the final
> send side completions are not received. This happens when the srp
> driver doesn't poll (i.e. call ib_process_cq_direct). The rest is
> my conjecture. Since there are several (e.g. qp#167 through qp#211 (odd))
> qp's with missing completions there are 23 iu's tied up when srp hangs.
> Your suggestion makes sense as why the hang occurs. When the test
> finishes the qp's are destroyed and the driver calls ib_process_cq_direct
> again which cleans up the resources.
> 
> The problem is that there isn't any obvious way to find a thread related
> to the missing cqe to poll for them. I think the best way to fix this is
> to convert the send side cq handling to interrupt driven (as is the case
> with the srpt driver.) The provider drivers have to run in any case to
> convert cqe's to wc's so there isn't much penalty to call the cq
> completion handler since there is already software running and then you
> will get reliable delivery of completions.

Can you add tracing to show that SRP is running out of SQ resources,
ie __srp_get_tx_iu() fails and that is a precondition for the hang?

I am fully willing to belive that is not ever tested.

Otherwise if srp thinks it has SQ resources then the SQ is probably
not the cause of the hang.

Jason