Delayed FC rport removal causes errors on other device

Tony Battersby <tonyb@xxxxxxxxxxxxxxx> · Wed, 17 Dec 2014 17:41:20 -0500

Initiator-mode problem summary:

FC initiator HBA cabled directly (no FC switch) to FC target device #1. 
Unplug cable from FC target device #1 and quickly plug it into FC target
device #2, keeping the other end of the cable plugged into the same
initiator HBA port.  The new device shows up immediately; begin sending
commands to it.  The old device stays visible for 30 - 60 seconds after
the cable was moved, and then it disappears with the message
"rport-7:0-0: blocked FC remote port time out: removing rport".  When
the old device disappears, commands outstanding to the new device are
aborted or lock up.

Initiator-mode problem details:

vanilla kernel version 3.18.1

I have three types of FC HBAs:
QLogic QLE2672 16Gb FC HBA using qla2xxx
QLogic QLE2562  8Gb FC HBA using qla2xxx
LSI    7204EP   4Gb FC HBA using mptfc

I have seen the problem with both qla2xxx and mptfc.  With qla2xxx,
commands active during the old rport removal are aborted with host
status 0x0E, but then it recovers and additional commands work fine. 
With mptfc, active commands lock up.

---

The problem I have described above happens using just the initiator-mode
drivers in the mainline kernel, but my real interest is a related
problem that happens with the QLogic target-mode drivers from
"git://git.qlogic.com/scst-qla2xxx.git".  I will describe that problem
with more details below.

---

Target-mode problem summary:

Two separate PCs, each with a QLogic QLE2672 16Gb FC HBA (firmware 7.04.01).
One PC uses the FC HBA in initiator mode; the other PC uses the FC HBA
in target mode.
The target-mode PC presents a disk drive to the initiator-mode PC over FC.
The FC HBAs are directly connected with a FC cable; no FC switch involved.

With the cable plugged in, the initiator PC sees the target-mode FC disk
at /dev/sg2.  When I unplug the cable from one port on the initiator and
quickly plug it into the other port on the initiator, a new disk shows
up for the new path at /dev/sg3.  The old disk at /dev/sg2 stays around
for about 30 seconds and then disappears.  That is all fine and
expected.  The problem is that when the old disk at /dev/sg2 disappears,
the new disk at /dev/sg3 stops responding to commands (but doesn't
disappear).

Note: in the initiator-mode test described at the beginning of this
message, the cable is moved from one target device to another target
device while keeping the same initiator port.  In contrast, in this
target-mode test, the cable is moved from one initiator port to another
initiator port while keeping the same target port.  That is what
distinguishes the two tests.  Whichever port remains the same (whether
initiator or target) is the port that causes the problem when the old
rport is removed.

Target-mode problem details:

vanilla kernel version 3.18.1

Before unplugging the cable, the target-mode PC creates a session with
portid 00:00:e8, loop_id 0, and the wwn of the first initiator port. 
When the cable is unplugged, the session is scheduled for deletion. 
When the cable is plugged into the other initiator port, the target-mode
PC creates another session with the same portid and loop_id as the first
session (which is now scheduled for deletion) but with a different wwn
corresponding to the second initiator port.

The new disk at /dev/sg3 stops responding to commands when the
target-mode PC calls isp_ops->fabric_logout() from
qla2x00_terminate_rport_io() in qla_attr.c.  If I disable that call to
fabric_logout() then the new disk at /dev/sg3 continues to work as
expected.  It looks like qla2x00_terminate_rport_io() is being called to
cleanup the old removed fcport, but ends up messing up the new
still-present fcport instead (I am guessing because the old and new
fcports share the same portid and loopid).  After fabric_logout() messes
up the new fcport, the target-mode HBA returns CTIO_PORT_LOGGED_OUT for
any new incoming commands from the initiator.

The problem can be avoided by waiting 30 seconds after unplugging the
cable before plugging it back in.  But that is not a good solution for
me since these HBAs are to be used in a product sold by my company, and
we want it to "just work" for our customers.

I am very familiar with SCSI but only a little bit with FC, so I am not
exactly sure of the correct fix.  So bear with me while I ask a few
questions: Is it correct for the old and new fcports to share the same
portid and loop_id?  When creating the new fcport, should qla2xxx have
detected that the lost-but-not-yet-dead fcport was using the same portid
and loop_id, and chosen to use different values for the new fcport
instead?  Or should it have invalidated the portid and/or loop_id of the
lost-but-not-yet-dead fcport somewhere (LOOP UP/LOOP DOWN/LIP
reset/etc.)?  Or is it OK for them to share the same portid/loop_id
values, but instead qla2x00_terminate_rport_io() needs more checks
before calling fabric_logout()?

I would be happy to test any patches that anyone can provide.  Or if
someone can provide answers to my questions above or other guidance,
then I can try to come up with a fix myself.

Thanks,
Tony Battersby
Cybernetics

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html