[PATCH 04/14] lpfc: Fix hang on unload due to stuck fport node

James Smart <jsmart2021@xxxxxxxxx> · Fri, 10 Sep 2021 16:31:49 -0700

A test scenario encountered an unload hang while an FLOGI ELS was in
flight when a link down condition occurred.  The driver fails unload
as it never releases the fport node.

For most nodes, when the link drops, devloss tmo is started and the
timeout will cause the final node release. For the Fport, as it has
not yet registered with the SCSI transport, there is no devloss timer
to be started, so there is no final release.  Additionally, the link
down sequence causes ABORTS to be issued for pending ELS's. The
completions from the ABORTS perform the release of node references.
However, as the adapter is being reset to be unloaded, those
completions will never occur.

Fix by the following:
- In the els cleanup, recognize when unloading and place the els's
  on a different list that immediately cleansup/completes the els's.
  It's recognized that this condition primarily affects only the
  fport, with other ports having normal clean up logic that handles
  things.
- Resolve the devloss issue by, when cleaning up nodes on after link
  down, recognizing when the fabric node does not have a completed
  state (its state is UNUSED) and removing a reference so the node
  can delete after the els reference is released.

Co-developed-by: Justin Tee <justin.tee@xxxxxxxxxxxx>
Signed-off-by: Justin Tee <justin.tee@xxxxxxxxxxxx>
Signed-off-by: James Smart <jsmart2021@xxxxxxxxx>
---
 drivers/scsi/lpfc/lpfc_els.c     | 14 ++++++++++++++
 drivers/scsi/lpfc/lpfc_hbadisc.c | 14 +++++++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/lpfc/lpfc_els.c b/drivers/scsi/lpfc/lpfc_els.c
index 262101e172ad..6c9cb87ef174 100644
--- a/drivers/scsi/lpfc/lpfc_els.c
+++ b/drivers/scsi/lpfc/lpfc_els.c
@@ -11386,6 +11386,7 @@ lpfc_sli4_vport_delete_els_xri_aborted(struct lpfc_vport *vport)
 {
 	struct lpfc_hba *phba = vport->phba;
 	struct lpfc_sglq *sglq_entry = NULL, *sglq_next = NULL;
+	struct lpfc_nodelist *ndlp = NULL;
 	unsigned long iflag = 0;
 
 	spin_lock_irqsave(&phba->sli4_hba.sgl_list_lock, iflag);
@@ -11393,7 +11394,20 @@ lpfc_sli4_vport_delete_els_xri_aborted(struct lpfc_vport *vport)
 			&phba->sli4_hba.lpfc_abts_els_sgl_list, list) {
 		if (sglq_entry->ndlp && sglq_entry->ndlp->vport == vport) {
 			lpfc_nlp_put(sglq_entry->ndlp);
+			ndlp = sglq_entry->ndlp;
 			sglq_entry->ndlp = NULL;
+
+			/* If the xri on the abts_els_sgl list is for the Fport
+			 * node and the vport is unloading, the xri aborted wcqe
+			 * likely isn't coming back.  Just release the sgl.
+			 */
+			if ((vport->load_flag & FC_UNLOADING) &&
+			    ndlp->nlp_DID == Fabric_DID) {
+				list_del(&sglq_entry->list);
+				sglq_entry->state = SGL_FREED;
+				list_add_tail(&sglq_entry->list,
+					&phba->sli4_hba.lpfc_els_sgl_list);
+			}
 		}
 	}
 	spin_unlock_irqrestore(&phba->sli4_hba.sgl_list_lock, iflag);
diff --git a/drivers/scsi/lpfc/lpfc_hbadisc.c b/drivers/scsi/lpfc/lpfc_hbadisc.c
index 6f2e07c30f98..4ff93aef3295 100644
--- a/drivers/scsi/lpfc/lpfc_hbadisc.c
+++ b/drivers/scsi/lpfc/lpfc_hbadisc.c
@@ -966,8 +966,20 @@ lpfc_cleanup_rpis(struct lpfc_vport *vport, int remove)
 	struct lpfc_nodelist *ndlp, *next_ndlp;
 
 	list_for_each_entry_safe(ndlp, next_ndlp, &vport->fc_nodes, nlp_listp) {
-		if (ndlp->nlp_state == NLP_STE_UNUSED_NODE)
+		if (ndlp->nlp_state == NLP_STE_UNUSED_NODE) {
+			/* It's possible the FLOGI to the fabric node never
+			 * successfully completed and never registered with the
+			 * transport.  In this case there is no way to clean up
+			 * the node.
+			 */
+			if (ndlp->nlp_DID == Fabric_DID) {
+				if (ndlp->nlp_prev_state ==
+				    NLP_STE_UNUSED_NODE &&
+				    !ndlp->fc4_xpt_flags)
+					lpfc_nlp_put(ndlp);
+			}
 			continue;
+		}
 
 		if ((phba->sli3_options & LPFC_SLI3_VPORT_TEARDOWN) ||
 		    ((vport->port_type == LPFC_NPIV_PORT) &&
-- 
2.26.2