On 3/17/2021 11:14 AM, Chuck Lever III wrote:
On Mar 16, 2021, at 3:58 PM, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
Hi-
I've been trying to track down some crashes when running NFS/RDMA
tests over FastLinQ devices in iWARP mode. To make it stressful,
I've enabled disconnect injection, where rpcrdma injects a
connection disconnect every so often.
As part of a disconnect event, the Receive and Send queues are
drained. Sometimes I see a duplicate flush for one or more of
memory registration ops. This is not a big deal for FastReq
because its completion handler is basically a no-op.
But for LocalInv this is a problem. On a flushed completion, the
MR is destroyed. If the completion occurs again, of course, all
kinds of badness happens because we're DMA-unmapping twice,
touching memory that has already been freed, and deleting from a
list_head that is poisonous.
The last straw is that wc_localinv_done calls the generic RPC layer
to indicate that an RPC Reply is ready. The duplicate flush
dereferences one or more NULL pointers.
So this looked to me like a Queue wrap. After sleeping on it, I
decided to try disabling xprtrdma's Send signal batching. Setting
ep_send_batch to zero causes every Send WR to be signaled, and
that makes the problem go away.
This is a little surprising. Every LocalInv chain is signaled. The
only possible accounting error might be that ep_send_count does
not count FastReg WRs, which are always unsignaled.
Well, perhaps you're posting several WRs, and the connection is being
dropped before you post them all. Therefore, you bail out with the
last one you did post being unsignaled. You had better hope that last
one is flushed, because if it completed successfully, you may have a
missing interrupt.
It's really tricky to get unsignaled right, when errors occur. It
might still be the provider, but there are possibilities on both
sides of the API.
More investigation needed.
Indeed, and good hunting!
Tom.
Doesn't the verbs API contract stipulate that every posted WR gets
exactly one completion? I don't see this behavior with other
providers.
Thanks for any advice.
--
Chuck Lever