On Tue, Oct 17, 2023 at 01:44:58PM -0500, Bob Pearson wrote: > On 10/17/23 12:58, Jason Gunthorpe wrote: > > On Tue, Oct 17, 2023 at 12:09:31PM -0500, Bob Pearson wrote: > > > > > >> For qp#167 the call to srp_post_send() is followed by the rxe driver > >> processing the send operation and generating a work completion which > >> is posted to the send cq but there is never a following call to > >> __srp_get_rx_iu() so the cqe is not received by srp and failure. > > > > ? I don't see this funcion in the kernel? __srp_get_tx_iu ? > > > >> I don't yet understand the logic of the srp driver to fix this but > >> the problem is not in the rxe driver as far as I can tell. > > > > It looks to me like __srp_get_tx_iu() is following the design pattern > > where the send queue is only polled when it needs to allocate a new > > send buffer - ie the send buffers are pre-allocated and cycle through > > the queue. > > > > So, it is not surprising this isn't being called if it is hung - the > > hang is probably something that is preventing it from even wanting to > > send, which is probably a receive side issue. > > > > Followup back up from that point to isolate what is the missing > > resouce to trigger send may bring some more clarity. > > > > Alternatively if __srp_get_tx_iu() is failing then perhaps you've run > > into an issue where it hit something rare and recovery does not work. > > > > eg this kind of design pattern carries a subtle assumption that the rx > > and send CQ are ordered together. Getting a rx CQ before a matching tx > > CQ can trigger the unusual scenario where the send side runs out of > > resources. > > > > Jason > > In all the traces I have looked at the hang only occurs once the final > send side completions are not received. This happens when the srp > driver doesn't poll (i.e. call ib_process_cq_direct). The rest is > my conjecture. Since there are several (e.g. qp#167 through qp#211 (odd)) > qp's with missing completions there are 23 iu's tied up when srp hangs. > Your suggestion makes sense as why the hang occurs. When the test > finishes the qp's are destroyed and the driver calls ib_process_cq_direct > again which cleans up the resources. > > The problem is that there isn't any obvious way to find a thread related > to the missing cqe to poll for them. I think the best way to fix this is > to convert the send side cq handling to interrupt driven (as is the case > with the srpt driver.) The provider drivers have to run in any case to > convert cqe's to wc's so there isn't much penalty to call the cq > completion handler since there is already software running and then you > will get reliable delivery of completions. Can you add tracing to show that SRP is running out of SQ resources, ie __srp_get_tx_iu() fails and that is a precondition for the hang? I am fully willing to belive that is not ever tested. Otherwise if srp thinks it has SQ resources then the SQ is probably not the cause of the hang. Jason