On Thu, 2017-02-23 at 09:54 -0500, Chuck Lever wrote: > > On Feb 22, 2017, at 10:25 PM, Trond Myklebust <trondmy@primarydata. > > com> wrote: > > > > On Wed, 2017-02-22 at 16:31 -0500, Chuck Lever wrote: > > > Hey Trond- > > > > > > To support the ability to unload the underlying RDMA device's > > > kernel > > > driver while NFS mounts are active, xprtrdma needs the ability to > > > suspend RPC sends temporarily while the transport hands HW > > > resources > > > back to the driver. Once the device driver is unloaded, the RDMA > > > transport is left disconnected, and RPCs will be suspended > > > normally > > > until a connection is possible again (eg, a new device is made > > > available). > > > > > > A DEVICE_REMOVAL event is an upcall to xprtrdma that may sleep. > > > Upon > > > its return, the device driver unloads itself. Currently my > > > prototype > > > frees all HW resources during the upcall, but that doesn't block > > > new RPCs from trying to use those resources at the same time. > > > > > > Seems like the most natural way to temporarily block sends would > > > be > > > to grab the transport's write lock, just like "connect" does, > > > while > > > the transport is dealing with DEVICE_REMOVAL, then release it > > > once > > > all HW resources have been freed. > > > > > > Unfortunately an RPC task is needed to acquire the write lock. > > > But > > > disconnect is just an asynchronous event, there is no RPC task > > > associated with it, and thus no context that the RPC scheduler > > > can put to sleep if there happens to be another RPC sending at > > > the > > > moment a device removal event occurs. > > > > > > I was looking at xprt_lock_connect, but that doesn't appear to do > > > quite what I need. > > > > > > Another thought was to have the DEVICE_REMOVAL upcall mark the > > > transport disconnected, send an asynchronous NULL RPC, then wait > > > on a kernel waitqueue. > > > > > > The NULL RPC would grab the write lock and kick the transport's > > > connect worker. The connect worker would free HW resources, then > > > awaken the waiter. Then the upcall could return to the driver. > > > > > > The problem with this scheme is the same as it was for the > > > keepalive work: there's no task or rpc_clnt available to the > > > DEVICE_REMOVAL upcall. Sleeping until the write lock is available > > > would require a task, and sending a NULL RPC would require an > > > rpc_clnt. > > > > > > Any advice/thoughts about this? > > > > > > > Can you perhaps use XPRT_FORCE_DISCONNECT? That does end up calling > > the > > xprt->ops->close() callback as soon as the XPRT_LOCK state has been > > freed. You still won't have a client, but you will be guaranteed > > exclusive access to the transport, and you can do things like > > waking up > > any sleeping tasks on the transmit and receive queue to help you. > > However you also have to deal with the case where the transport was > > idle to start with. > > The case where the transport is idle is indeed the most bothersome. > > I hadn't realized that ->close guaranteed exclusive access to > the transport. If ->close guarantees access to the transport, then > I don't need an RPC task or the NULL RPC because we don't ever have > to wait for write access. > Yes, ->close() must always be taken with XPRT_LOCK set, and it is guaranteed to be called the instant that the current holder of XPRT_LOCK relinquishes it. This is why we have the test for XPRT_CLOSE_WAIT in xprt_clear_locked(); it short-circuits the wait queues and immediately starts queue_work(xprtiod_workqueue, &xprt- >task_cleanup). > The hardware resources can be torn down immediately in > xprt_rdma_close. Sounds like the right thing to do then. > > The big problem that you have here is ultimately that the low level > > control channel for the transport appears to want to use the RPC > > upper > > layer functionality for its communication mechanism. AFAICS you > > will > > keep hitting issues as the control channel needs to circumvent all > > the > > queueing etc that these upper layers are designed to enforce. > > Given that these messages you're sending are just null pings with > > no > > payload and no special authentication needs or anything else, might > > it > > make sense to just generate them in the RDMA layer itself? > > (I don't really need a NULL RPC for the DEVICE_REMOVAL case. I > just need an RPC task). > > There seems to be a strong community belief that code duplication > is bad. Constructing even a NULL RPC without using the existing > infrastructure would duplicate a large amount of code. > > Eventually RPC-over-RDMA should have its own ping mechanism, > probably. But Version One doesn't. > > An alternative to ping here would be to simply force the > connection to disconnect if an RPC times out. That can be done > trivially in the timer call out, for example. Right. I agree that the ->timer() callback will work just fine for that. -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@xxxxxxxxxxxxxxx ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥