Re: Reconnect on RDMA device reset

Doug Ledford <dledford@xxxxxxxxxx> · Thu, 25 Jan 2018 13:13:42 -0500

On Wed, 2018-01-24 at 22:52 +0200, Sagi Grimberg wrote:
> > > > Today host and target stacks will respond to RDMA device reset (or plug
> > > > out
> > > > and plug in) by cleaning all resources related to that device, and sitting
> > > > idle waiting for administrator intervention to reconnect (host stack) or
> > > > rebind subsystem to a port (target stack).
> > > > 
> > > > I'm thinking that maybe the right behaviour should be to try and restore
> > > > everything as soon as the device becomes available again. I don't think a
> > > > device reset should look different to the users than ports going down and
> > > > up
> > > > again.
> > > 
> > > 
> > > Hmm, not sure I fully agree here. In my mind device removal means the
> > > device is going away which means there is no point in keeping the controller
> > > around...
> > 
> > The same could have been said on a port going down. You don't know if it will
> > come back up connected to the same network...
> 
> That's true. However in my mind port events are considered transient,
> and we do give up at some point. I'm simply arguing that device removal
> has different semantics. I don't argue that we need to support it.

I think it depends on how you view yourself (meaning the target or
initiator stacks).  It's my understanding that if device eth0
disappeared completely, and then device eth1 was plugged in, and eth1
got the same ip address as eth0, then as long as any TCP sockets hadn't
gone into reset state, the iSCSI devices across the existing connection
would simply keep working.  This is correct, yes?  If so, then maybe you
want iSER at least to operate the same way.  The problem, of course, is
that iSER may use the IP address and ports for connection, but then it
transitions to queue pairs for data transfer.  Because iSER does that,
it is sitting at the same level as, say, the net core that *did* know
about the eth change in the above example and transitioned the TCP
socket from the old device to the new, meaning that iSER now has to take
that same responsibility on itself if it wishes the user visible
behavior of iSER devices to be the same as iSCSI devices.  And that
would even be true if the old RDMA device went away and a new RDMA
device came up with the old IP address, so the less drastic form of
bouncing the existing device should certainly fall under the same
umbrella.

I *think* for SRP this is already the case.  The SRP target uses the
kernel LIO framework, so if you bounce the device under the SRPt layer,
doesn't the config get preserved?  So that when the device came back up,
the LIO configuration would still be there and the SRPt driver would see
that? Bart?

For the SRP client, I'm almost certain it will try to reconnect since it
uses a user space daemon with a shell script that restarts the daemon on
various events.  That might have changed...didn't we just take a patch
to rdma-core to drop the shell script?  It might not reconnect
automatically with the latest rdma-core, I'd have to check.  Bart should
know though...

I haven't the faintest clue on NVMe over fabrics though.  But, again, I
think that's up to you guys to decide what semantics you want.  With
iSER it's a little easier since you can use the TCP semantics as a
guideline and you have an IP/port discovery so it doesn't even have to
be the same controller that comes back.  With SRP it must be the same
controller that comes back or else your login information will be all
wrong (well, we did just take RDMA_CM support patches for SRP that will
allow IP/port addressing instead, so theoretically it could now do the
same thing if you are using RDMA_CM mode logins).  I don't know the
details of the NVMe addressing though.

> > > AFAIK device resets usually are expected to quiesce inflight I/O,
> > > cleanup resources and restore when the reset sequence completes (which is
> > > what we do in nvme controller resets).

I think your perspective here might be a bit skewed by the way the NVMe
stack is implemented (which was intentional for speed as I understand
it).  As a differing example, in the SCSI stack when the LLD does a SCSI
host reset, it resets the host but does not restore or restart any
commands that were aborted.  It is up to the upper layer SCSI drivers to
do so (if they chose, they might send it back to the block layer).  From
the way you wrote the above, it sounds like the NVMe layer is almost
monolithic in nature with no separation between upper level consumer
layer and lower level driver layer, and so you can reset/restart all
internally.  I would argue that's rare in the linux kernel and most
places the low level driver resets, and some other upper layer has to
restart things if it wants or error out if it doesn't.

> > >  I'm not sure I understand why
> > > RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via
> > > rdma_cm or .remove_one via ib_client API). I think the correct interface
> > > would be suspend/resume semantics for RDMA device resets (similar to pm
> > > interface).

No, we can't do this.  Suspend/Resume is not the right model for an RDMA
device reset.  An RDMA device reset is a hard action that stops all
ongoing DMA regardless of its source.  Those sources include kernel
layer consumers, user space consumers acting without the kernel's direct
intervention, and ongoing DMA with remote RDMA peers (which will throw
the remote queue pairs into an error state almost immediately).  In the
future it very likely could include RDMA between things like GPU offload
processors too.  We can't restart that stuff even if we wanted to.  So
suspend/resume semantics for an RDMA device level reset is a non-
starter.

> > > I think that it would make a much cleaner semantics and ULPs should be
> > > able to understand exactly what to do (which is what you suggested
> > > above).
> > > 
> > > CCing linux-rdma.
> > 
> > Maybe so. I don't know what's the "standard" here for Linux in general and
> > networking devices in particular. Let's see if linux-rdma agree here.
> 
> I would like to hear more opinions on the current interface.

There is a difference between RDMA device and other network devices. 
The net stack is much more like the SCSI stack in that you have an upper
layer connection (socket or otherwise) and a lower layer transport and
the net core code which is free to move your upper layer abstraction
from one lower layer transport to another.  With the RDMA subsystem,
your upper layer is connecting directly into the low level hardware.  If
you want a semantic that includes reconnection on an event, then it has
to be handled in your upper layer as there is no intervening middle
layer to abstract out the task of moving your connection from one low
level device to another (that's not to say we couldn't create one, and
several actually already exist, like SMC-R and RDS, but direct hooks
into the core ib stack are not abstracted out and you are talking
directly to the hardware).  And if you want to support moving your
connection from an old removed device to a new replacement device that
is not simply the same physical device being plugged back in, then you
need an addressing scheme that doesn't rely on the link layer hardware
address of the device.

> > > Regardless of ib_client vs. rdma_cm, we can't simply perform normal
> > > reconnects because we have dma mappings we need to unmap for each
> > > request in the tagset which we don't teardown in every reconnect (as
> > > we may have inflight I/O). We could have theoretically use reinit_tagset
> > > to do that though.
> > 
> > Obviously it isn't that simple... Just trying to agree on the right direction
> > to go.
> 
> Yea, I agree. It shouldn't be too hard also.
> 
> > > > In the reconnect flow the stack already repeats creating the cm_id and
> > > > resolving address and route, so when the RDMA device comes back up, and
> > > > assuming it will be configured with the same address and connected to the
> > > > same
> > > > network (as is the case in device reset), connections will be restored
> > > > automatically.
> > > 
> > > 
> > > As I said, I think that the problem is the interface of RDMA device
> > > resets. IMO, device removal means we need to delete all the nvme
> > > controllers associated with the device.
> > 
> > Do you think all associated controllers should be deleted when a TCP socket
> > gets disconnected in NVMe-over-TCP? Do they?
> 
> Nope, but that is equivalent to QP going into error state IMO, and we
> don't do that in nvme-rdma as well.

There is no equivalent in the TCP realm of an RDMA controller reset or
an RDMA controller permanent removal event.  When dealing with TCP, if
the underlying ethernet device is reset, you *might* get a TCP socket
reset, you might not.  If the underlying ethernet is removed, you might
get a socket reset, you might not, depending on how the route to the
remote host is re-established.  If all IP capable devices in the entire
system are removed, your TCP socket will get a reset, and attempts to
reconnect will get an error.

None of those sound semantically comparable to RDMA device
unplug/replug.  Again, that's just because the net core never percolates
that up to the TCP layer.

When you have a driver that has both TCP and RDMA transports, the truth
is you are plugging into two very different levels of the kernel and the
work you have to do to support one is very different from the other.  I
don't think it's worthwhile to even talk about trying to treat them
equivalently unless you want to take on an address scheme and
reset/restart capability in the RDMA side of things that you don't have
to have in the TCP side of things.

As a user of things like iSER/SRP/NVMe, I would personally like
connections to persist across non-fatal events.  But the RDMA stack, as
it stands, can't reconnect things for you, you would have to do that in
your own code.

-- 
Doug Ledford <dledford@xxxxxxxxxx>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
Attachment:
signature.asc

Description: This is a digitally signed message part