On 10/17/23 13:51, Jason Gunthorpe wrote: > On Tue, Oct 17, 2023 at 01:44:58PM -0500, Bob Pearson wrote: >> On 10/17/23 12:58, Jason Gunthorpe wrote: >>> On Tue, Oct 17, 2023 at 12:09:31PM -0500, Bob Pearson wrote: >>> >>> >>>> For qp#167 the call to srp_post_send() is followed by the rxe driver >>>> processing the send operation and generating a work completion which >>>> is posted to the send cq but there is never a following call to >>>> __srp_get_rx_iu() so the cqe is not received by srp and failure. >>> >>> ? I don't see this funcion in the kernel? __srp_get_tx_iu ? >>> >>>> I don't yet understand the logic of the srp driver to fix this but >>>> the problem is not in the rxe driver as far as I can tell. >>> >>> It looks to me like __srp_get_tx_iu() is following the design pattern >>> where the send queue is only polled when it needs to allocate a new >>> send buffer - ie the send buffers are pre-allocated and cycle through >>> the queue. >>> >>> So, it is not surprising this isn't being called if it is hung - the >>> hang is probably something that is preventing it from even wanting to >>> send, which is probably a receive side issue. >>> >>> Followup back up from that point to isolate what is the missing >>> resouce to trigger send may bring some more clarity. >>> >>> Alternatively if __srp_get_tx_iu() is failing then perhaps you've run >>> into an issue where it hit something rare and recovery does not work. >>> >>> eg this kind of design pattern carries a subtle assumption that the rx >>> and send CQ are ordered together. Getting a rx CQ before a matching tx >>> CQ can trigger the unusual scenario where the send side runs out of >>> resources. >>> >>> Jason >> >> In all the traces I have looked at the hang only occurs once the final >> send side completions are not received. This happens when the srp >> driver doesn't poll (i.e. call ib_process_cq_direct). The rest is >> my conjecture. Since there are several (e.g. qp#167 through qp#211 (odd)) >> qp's with missing completions there are 23 iu's tied up when srp hangs. >> Your suggestion makes sense as why the hang occurs. When the test >> finishes the qp's are destroyed and the driver calls ib_process_cq_direct >> again which cleans up the resources. >> >> The problem is that there isn't any obvious way to find a thread related >> to the missing cqe to poll for them. I think the best way to fix this is >> to convert the send side cq handling to interrupt driven (as is the case >> with the srpt driver.) The provider drivers have to run in any case to >> convert cqe's to wc's so there isn't much penalty to call the cq >> completion handler since there is already software running and then you >> will get reliable delivery of completions. > > Can you add tracing to show that SRP is running out of SQ resources, > ie __srp_get_tx_iu() fails and that is a precondition for the hang? > > I am fully willing to belive that is not ever tested. > > Otherwise if srp thinks it has SQ resources then the SQ is probably > not the cause of the hang. > > Jason Well.... the extra tracing did *not* show srp running out of iu's. So I converted cq handling to IB_POLL_SOFTIRQ from IB_POLL_DIRECT. This required adding a spinlock around list_add(&iu->list, ...) in srp_send_done(). The test now runs with all the completions handled correctly. But, it still hangs. So a red herring. The hunt continues. Bob