> On Mar 16, 2021, at 3:58 PM, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > > Hi- > > I've been trying to track down some crashes when running NFS/RDMA > tests over FastLinQ devices in iWARP mode. To make it stressful, > I've enabled disconnect injection, where rpcrdma injects a > connection disconnect every so often. > > As part of a disconnect event, the Receive and Send queues are > drained. Sometimes I see a duplicate flush for one or more of > memory registration ops. This is not a big deal for FastReq > because its completion handler is basically a no-op. > > But for LocalInv this is a problem. On a flushed completion, the > MR is destroyed. If the completion occurs again, of course, all > kinds of badness happens because we're DMA-unmapping twice, > touching memory that has already been freed, and deleting from a > list_head that is poisonous. > > The last straw is that wc_localinv_done calls the generic RPC layer > to indicate that an RPC Reply is ready. The duplicate flush > dereferences one or more NULL pointers. So this looked to me like a Queue wrap. After sleeping on it, I decided to try disabling xprtrdma's Send signal batching. Setting ep_send_batch to zero causes every Send WR to be signaled, and that makes the problem go away. This is a little surprising. Every LocalInv chain is signaled. The only possible accounting error might be that ep_send_count does not count FastReg WRs, which are always unsignaled. More investigation needed. > Doesn't the verbs API contract stipulate that every posted WR gets > exactly one completion? I don't see this behavior with other > providers. > > Thanks for any advice. -- Chuck Lever