> On Jul 2, 2017, at 5:45 AM, Sagi Grimberg <sagi@xxxxxxxxxxx> wrote: > > >>> Or wait for the send completion before completing the I/O? >> In the normal case, that works. >> If a POSIX signal occurs (^C, RPC timeout), the RPC exits immediately >> and recovers all resources. The Send can still be running at that >> point, and it can't be stopped (without transitioning the QP to >> error state, I guess). > > In that case we can't complete the I/O either (or move the > QP into error state), we need to defer/sleep on send completion. Unfortunately the RPC client finite state machine mustn't sleep when a POSIX signal fires. xprtrdma has to unblock the waiting application process but clean up the resources asynchronously. The RPC completion doesn't have to wait on DMA unmapping the send buffer. What would have to wait is cleaning up the resources -- in particular, allowing the rpcrdma_req structure, where the send SGEs are kept, to be re-used. In the current design, both happen at the same time. >> The alternative is reference-counting the data structure that has >> the ib_cqe and the SGE array. That adds one or more atomic_t >> operations per I/O that I'd like to avoid. > > Why atomics? Either an atomic reference count or a spin lock is necessary because there are two different ways an RPC can exit: 1. The common way, which is through receipt of an RPC reply, handled by rpcrdma_reply_handler. 2. POSIX signal, where the RPC reply races with the wake-up of the application process (in other words, the reply can still arrive while the RPC is terminating). In both cases, the RPC client has to invalidate any registered memory, and it has to be done no more and no less than once. I deal with some of this in my for-13 patches: http://marc.info/?l=linux-nfs&m=149693711119727&w=2 The first seven patches handle the race condition and the need for exactly-once invalidation. But the issue with unmapping the Send buffers has to do with how the Send SGEs are managed. The data structure containing the SGEs goes away once the RPC is complete. So there are two "users": one is the RPC completion, and one is the Send completion. Once both are done, the data structure can be released. But RPC completion can't wait if the Send completion hasn't yet fired. I could kmalloc the SGE array instead, signal each Send, and then in the Send completion handler, unmap the SGEs and then kfree the SGE array. That's a lot of overhead. Or I could revert all the "map page cache pages" logic and just use memcpy for small NFS WRITEs, and RDMA the rest of the time. That keeps everything simple, but means large inline thresholds can't use send-in-place. I'm still open to suggestion. for-4.14 will deal with other problems, unless an obvious and easy fix arises. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html