Re: Unexpected issues with 2 NVME initiators using the same target

Chuck Lever <chuck.lever@xxxxxxxxxx> · Sun, 2 Jul 2017 14:17:52 -0400

> On Jul 2, 2017, at 5:45 AM, Sagi Grimberg <sagi@xxxxxxxxxxx> wrote:
> 
> 
>>> Or wait for the send completion before completing the I/O?
>> In the normal case, that works.
>> If a POSIX signal occurs (^C, RPC timeout), the RPC exits immediately
>> and recovers all resources. The Send can still be running at that
>> point, and it can't be stopped (without transitioning the QP to
>> error state, I guess).
> 
> In that case we can't complete the I/O either (or move the
> QP into error state), we need to defer/sleep on send completion.

Unfortunately the RPC client finite state machine mustn't
sleep when a POSIX signal fires. xprtrdma has to unblock the
waiting application process but clean up the resources
asynchronously.

The RPC completion doesn't have to wait on DMA unmapping the
send buffer. What would have to wait is cleaning up the
resources -- in particular, allowing the rpcrdma_req
structure, where the send SGEs are kept, to be re-used. In
the current design, both happen at the same time.

>> The alternative is reference-counting the data structure that has
>> the ib_cqe and the SGE array. That adds one or more atomic_t
>> operations per I/O that I'd like to avoid.
> 
> Why atomics?

Either an atomic reference count or a spin lock is necessary
because there are two different ways an RPC can exit:

1. The common way, which is through receipt of an RPC reply,
handled by rpcrdma_reply_handler.

2. POSIX signal, where the RPC reply races with the wake-up
of the application process (in other words, the reply can
still arrive while the RPC is terminating).

In both cases, the RPC client has to invalidate any
registered memory, and it has to be done no more and no less
than once.

I deal with some of this in my for-13 patches:

http://marc.info/?l=linux-nfs&m=149693711119727&w=2

The first seven patches handle the race condition and the
need for exactly-once invalidation.

But the issue with unmapping the Send buffers has to do
with how the Send SGEs are managed. The data structure
containing the SGEs goes away once the RPC is complete.

So there are two "users": one is the RPC completion, and
one is the Send completion. Once both are done, the data
structure can be released. But RPC completion can't wait
if the Send completion hasn't yet fired.

I could kmalloc the SGE array instead, signal each Send,
and then in the Send completion handler, unmap the SGEs
and then kfree the SGE array. That's a lot of overhead.

Or I could revert all the "map page cache pages" logic and
just use memcpy for small NFS WRITEs, and RDMA the rest of
the time. That keeps everything simple, but means large
inline thresholds can't use send-in-place.

I'm still open to suggestion. for-4.14 will deal with other
problems, unless an obvious and easy fix arises.

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html