Re: FastLinQ: possible duplicate flush of FastReg and LocalInv

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Wed, 17 Mar 2021 15:14:51 +0000

> On Mar 16, 2021, at 3:58 PM, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
> 
> Hi-
> 
> I've been trying to track down some crashes when running NFS/RDMA
> tests over FastLinQ devices in iWARP mode. To make it stressful,
> I've enabled disconnect injection, where rpcrdma injects a
> connection disconnect every so often.
> 
> As part of a disconnect event, the Receive and Send queues are
> drained. Sometimes I see a duplicate flush for one or more of
> memory registration ops. This is not a big deal for FastReq
> because its completion handler is basically a no-op.
> 
> But for LocalInv this is a problem. On a flushed completion, the
> MR is destroyed. If the completion occurs again, of course, all
> kinds of badness happens because we're DMA-unmapping twice,
> touching memory that has already been freed, and deleting from a
> list_head that is poisonous.
> 
> The last straw is that wc_localinv_done calls the generic RPC layer
> to indicate that an RPC Reply is ready. The duplicate flush
> dereferences one or more NULL pointers.

So this looked to me like a Queue wrap. After sleeping on it, I
decided to try disabling xprtrdma's Send signal batching. Setting
ep_send_batch to zero causes every Send WR to be signaled, and
that makes the problem go away.

This is a little surprising. Every LocalInv chain is signaled. The
only possible accounting error might be that ep_send_count does
not count FastReg WRs, which are always unsignaled.

More investigation needed.

> Doesn't the verbs API contract stipulate that every posted WR gets
> exactly one completion? I don't see this behavior with other
> providers.
> 
> Thanks for any advice.

--
Chuck Lever