> On Mar 17, 2021, at 2:39 PM, Tom Talpey <tom@xxxxxxxxxx> wrote: > > On 3/17/2021 11:14 AM, Chuck Lever III wrote: >>> On Mar 16, 2021, at 3:58 PM, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: >>> >>> Hi- >>> >>> I've been trying to track down some crashes when running NFS/RDMA >>> tests over FastLinQ devices in iWARP mode. To make it stressful, >>> I've enabled disconnect injection, where rpcrdma injects a >>> connection disconnect every so often. >>> >>> As part of a disconnect event, the Receive and Send queues are >>> drained. Sometimes I see a duplicate flush for one or more of >>> memory registration ops. This is not a big deal for FastReq >>> because its completion handler is basically a no-op. >>> >>> But for LocalInv this is a problem. On a flushed completion, the >>> MR is destroyed. If the completion occurs again, of course, all >>> kinds of badness happens because we're DMA-unmapping twice, >>> touching memory that has already been freed, and deleting from a >>> list_head that is poisonous. >>> >>> The last straw is that wc_localinv_done calls the generic RPC layer >>> to indicate that an RPC Reply is ready. The duplicate flush >>> dereferences one or more NULL pointers. >> So this looked to me like a Queue wrap. After sleeping on it, I >> decided to try disabling xprtrdma's Send signal batching. Setting >> ep_send_batch to zero causes every Send WR to be signaled, and >> that makes the problem go away. >> This is a little surprising. Every LocalInv chain is signaled. The >> only possible accounting error might be that ep_send_count does >> not count FastReg WRs, which are always unsignaled. > > Well, perhaps you're posting several WRs, and the connection is being > dropped before you post them all. Therefore, you bail out with the > last one you did post being unsignaled. You had better hope that last > one is flushed, because if it completed successfully, you may have a > missing interrupt. > > It's really tricky to get unsignaled right, when errors occur. It > might still be the provider, but there are possibilities on both > sides of the API. My current theory is that the only duplicate completions occur when WRs have been posted after a disconnect. This happens in the window where the workload is still active and the connection has been lost, but before the DISCONNECTED CM event. My expectation was that such a WR would flush through and complete once. What I'm seeing is that on occasion one or more WRs that were posted in this window complete twice. If I add some logic to block posting in that window, the duplicate completion problem seems to go away. The test runs long enough without a duplication completion that I hit other bugs. I never see duplicate Receive or Send completions. When a duplicate completion occurs with LocalInv, I typically see duplicate completions for all WRs on the same chained post. That might be the case for FastReg also, I haven't looked closely, but the Send WR these are chained to never sees a duplicate completion (could be my duplicate checking logic for Sends doesn't work?). This is with a QLogic Corp. FastLinQ QL41212HLCU 25GbE Adapter and Storm FW 8.42.2.0, Management FW 8.30.18.0 [MBI 8.30.29]. -- Chuck Lever