On 4/9/2018 1:03 AM, Hannes Reinecke wrote:
On Sat, 7 Apr 2018 11:30:24 -0700
James Smart <jsmart2021@xxxxxxxxx> wrote:
Driver unload isn't waiting for all outstanding nvme associations
to terminate before clearing structures. In particular, it did not
set dev_loss_tmo to 0 such that all associations are immediately
terminated. Thus the transport would enter reconnect timeouts and
reattempt reconnect to an nvme controller. The call makes a call
into the driver to create hw queues for the controller which causes
a NULL pointer reference.
Correct by changing the teardown process to change all dev_loss_tmo
timeouts to 0 so that they are immediate. Now the teardown process
initiates, the remote ports unregistered and delete callback made,
and as the assocations are immediate upon remoteport unregister, the
transport will not longer invoke the callbacks for a new controller.
Signed-off-by: Dick Kennedy <dick.kennedy@xxxxxxxxxxxx>
Signed-off-by: James Smart <james.smart@xxxxxxxxxxxx>
---
drivers/scsi/lpfc/lpfc_hbadisc.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
Hmm. This seems to be a very circumspect way of deleting all
outstanding I/O...
Is there any guarantee that nvme_fc_set_remoteport_devloss() will
return only after all callbacks are invoked?
well roundabout - I agree. No, the set_remoteport_devloss won't make
the guarantee, but the unregister_remoteport and the wait for the
remoteport_delete call will.
And as I look deeper at this failure scenario, I'm starting to believe
that the actual problem was the missed unregister_remoteport that was
one of the other problems corrected in the patch set - by patch 12 or 14.
I'm going to repost, pulling this patch from the set. We'll retest and
if still needed, we'll fix it in the next patch set.
-- james