> On Jan 25, 2018, at 10:13 AM, Doug Ledford <dledford@xxxxxxxxxx> wrote: > > On Wed, 2018-01-24 at 22:52 +0200, Sagi Grimberg wrote: >>>>> Today host and target stacks will respond to RDMA device reset (or plug >>>>> out >>>>> and plug in) by cleaning all resources related to that device, and sitting >>>>> idle waiting for administrator intervention to reconnect (host stack) or >>>>> rebind subsystem to a port (target stack). >>>>> >>>>> I'm thinking that maybe the right behaviour should be to try and restore >>>>> everything as soon as the device becomes available again. I don't think a >>>>> device reset should look different to the users than ports going down and >>>>> up >>>>> again. >>>> >>>> >>>> Hmm, not sure I fully agree here. In my mind device removal means the >>>> device is going away which means there is no point in keeping the controller >>>> around... >>> >>> The same could have been said on a port going down. You don't know if it will >>> come back up connected to the same network... >> >> That's true. However in my mind port events are considered transient, >> and we do give up at some point. I'm simply arguing that device removal >> has different semantics. I don't argue that we need to support it. > > I think it depends on how you view yourself (meaning the target or > initiator stacks). It's my understanding that if device eth0 > disappeared completely, and then device eth1 was plugged in, and eth1 > got the same ip address as eth0, then as long as any TCP sockets hadn't > gone into reset state, the iSCSI devices across the existing connection > would simply keep working. This is correct, yes? For NFS/RDMA, I think of the "failover" case where a device is removed, then a new one is plugged in (or an existing cold replacement is made available) with the same IP configuration. On a "hard" NFS mount, we want the upper layers to wait for a new suitable device to be made available, and then to use it to resend any pending RPCs. The workload should continue after a new device is available. Feel free to tell me I'm full of turtles. > If so, then maybe you > want iSER at least to operate the same way. The problem, of course, is > that iSER may use the IP address and ports for connection, but then it > transitions to queue pairs for data transfer. Because iSER does that, > it is sitting at the same level as, say, the net core that *did* know > about the eth change in the above example and transitioned the TCP > socket from the old device to the new, meaning that iSER now has to take > that same responsibility on itself if it wishes the user visible > behavior of iSER devices to be the same as iSCSI devices. And that > would even be true if the old RDMA device went away and a new RDMA > device came up with the old IP address, so the less drastic form of > bouncing the existing device should certainly fall under the same > umbrella. > > I *think* for SRP this is already the case. The SRP target uses the > kernel LIO framework, so if you bounce the device under the SRPt layer, > doesn't the config get preserved? So that when the device came back up, > the LIO configuration would still be there and the SRPt driver would see > that? Bart? > > For the SRP client, I'm almost certain it will try to reconnect since it > uses a user space daemon with a shell script that restarts the daemon on > various events. That might have changed...didn't we just take a patch > to rdma-core to drop the shell script? It might not reconnect > automatically with the latest rdma-core, I'd have to check. Bart should > know though... > > I haven't the faintest clue on NVMe over fabrics though. But, again, I > think that's up to you guys to decide what semantics you want. With > iSER it's a little easier since you can use the TCP semantics as a > guideline and you have an IP/port discovery so it doesn't even have to > be the same controller that comes back. With SRP it must be the same > controller that comes back or else your login information will be all > wrong (well, we did just take RDMA_CM support patches for SRP that will > allow IP/port addressing instead, so theoretically it could now do the > same thing if you are using RDMA_CM mode logins). I don't know the > details of the NVMe addressing though. > >>>> AFAIK device resets usually are expected to quiesce inflight I/O, >>>> cleanup resources and restore when the reset sequence completes (which is >>>> what we do in nvme controller resets). > > I think your perspective here might be a bit skewed by the way the NVMe > stack is implemented (which was intentional for speed as I understand > it). As a differing example, in the SCSI stack when the LLD does a SCSI > host reset, it resets the host but does not restore or restart any > commands that were aborted. It is up to the upper layer SCSI drivers to > do so (if they chose, they might send it back to the block layer). From > the way you wrote the above, it sounds like the NVMe layer is almost > monolithic in nature with no separation between upper level consumer > layer and lower level driver layer, and so you can reset/restart all > internally. I would argue that's rare in the linux kernel and most > places the low level driver resets, and some other upper layer has to > restart things if it wants or error out if it doesn't. > >>>> I'm not sure I understand why >>>> RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via >>>> rdma_cm or .remove_one via ib_client API). I think the correct interface >>>> would be suspend/resume semantics for RDMA device resets (similar to pm >>>> interface). > > No, we can't do this. Suspend/Resume is not the right model for an RDMA > device reset. An RDMA device reset is a hard action that stops all > ongoing DMA regardless of its source. Those sources include kernel > layer consumers, user space consumers acting without the kernel's direct > intervention, and ongoing DMA with remote RDMA peers (which will throw > the remote queue pairs into an error state almost immediately). In the > future it very likely could include RDMA between things like GPU offload > processors too. We can't restart that stuff even if we wanted to. So > suspend/resume semantics for an RDMA device level reset is a non- > starter. > >>>> I think that it would make a much cleaner semantics and ULPs should be >>>> able to understand exactly what to do (which is what you suggested >>>> above). >>>> >>>> CCing linux-rdma. >>> >>> Maybe so. I don't know what's the "standard" here for Linux in general and >>> networking devices in particular. Let's see if linux-rdma agree here. >> >> I would like to hear more opinions on the current interface. > > There is a difference between RDMA device and other network devices. > The net stack is much more like the SCSI stack in that you have an upper > layer connection (socket or otherwise) and a lower layer transport and > the net core code which is free to move your upper layer abstraction > from one lower layer transport to another. With the RDMA subsystem, > your upper layer is connecting directly into the low level hardware. If > you want a semantic that includes reconnection on an event, then it has > to be handled in your upper layer as there is no intervening middle > layer to abstract out the task of moving your connection from one low > level device to another (that's not to say we couldn't create one, and > several actually already exist, like SMC-R and RDS, but direct hooks > into the core ib stack are not abstracted out and you are talking > directly to the hardware). And if you want to support moving your > connection from an old removed device to a new replacement device that > is not simply the same physical device being plugged back in, then you > need an addressing scheme that doesn't rely on the link layer hardware > address of the device. > >>>> Regardless of ib_client vs. rdma_cm, we can't simply perform normal >>>> reconnects because we have dma mappings we need to unmap for each >>>> request in the tagset which we don't teardown in every reconnect (as >>>> we may have inflight I/O). We could have theoretically use reinit_tagset >>>> to do that though. >>> >>> Obviously it isn't that simple... Just trying to agree on the right direction >>> to go. >> >> Yea, I agree. It shouldn't be too hard also. >> >>>>> In the reconnect flow the stack already repeats creating the cm_id and >>>>> resolving address and route, so when the RDMA device comes back up, and >>>>> assuming it will be configured with the same address and connected to the >>>>> same >>>>> network (as is the case in device reset), connections will be restored >>>>> automatically. >>>> >>>> >>>> As I said, I think that the problem is the interface of RDMA device >>>> resets. IMO, device removal means we need to delete all the nvme >>>> controllers associated with the device. >>> >>> Do you think all associated controllers should be deleted when a TCP socket >>> gets disconnected in NVMe-over-TCP? Do they? >> >> Nope, but that is equivalent to QP going into error state IMO, and we >> don't do that in nvme-rdma as well. > > There is no equivalent in the TCP realm of an RDMA controller reset or > an RDMA controller permanent removal event. When dealing with TCP, if > the underlying ethernet device is reset, you *might* get a TCP socket > reset, you might not. If the underlying ethernet is removed, you might > get a socket reset, you might not, depending on how the route to the > remote host is re-established. If all IP capable devices in the entire > system are removed, your TCP socket will get a reset, and attempts to > reconnect will get an error. > > None of those sound semantically comparable to RDMA device > unplug/replug. Again, that's just because the net core never percolates > that up to the TCP layer. > > When you have a driver that has both TCP and RDMA transports, the truth > is you are plugging into two very different levels of the kernel and the > work you have to do to support one is very different from the other. I > don't think it's worthwhile to even talk about trying to treat them > equivalently unless you want to take on an address scheme and > reset/restart capability in the RDMA side of things that you don't have > to have in the TCP side of things. > > As a user of things like iSER/SRP/NVMe, I would personally like > connections to persist across non-fatal events. But the RDMA stack, as > it stands, can't reconnect things for you, you would have to do that in > your own code. > > -- > Doug Ledford <dledford@xxxxxxxxxx> > GPG KeyID: B826A3330E572FDD > Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html