Re: [PATCH 03/10] scsi: Use scsi_target as argument for eh_target_reset_handler()

Hannes Reinecke <hare@xxxxxxx> · Wed, 25 Oct 2023 17:35:53 +0200

On 10/25/23 17:11, Benjamin Block wrote:
On Mon, Oct 23, 2023 at 11:28:30AM +0200, Hannes Reinecke wrote:

diff --git a/Documentation/scsi/scsi_eh.rst b/Documentation/scsi/scsi_eh.rst
index 553aff3062d5..cfaf841317ed 100644
--- a/Documentation/scsi/scsi_eh.rst
+++ b/Documentation/scsi/scsi_eh.rst
@@ -411,6 +412,14 @@ scmd->allowed.
  	    resetting clears all scmds on the sdev, there is no need
  	    to choose error-completed scmds.
  
+	2. If !list_empty(&eh_work_q), invoke scsi_eh_target_reset().
+
+	``scsi_eh_target_reset``
+
+	    hostt->eh_target_reset_handler() is invoked for each target.

It's only invoked for targets that have commands on the work-q, not each. We
could have 16 rports open, but only one of them has a command that timed out,
and is in EH.

At least that is what I thought it does.

Oh, yes. It is invoked for every target for which a scmd is on the 
work_q. Will be updating the patch.

+	    If target reset succeeds, all failed scmds on all ready or
+	    offline sdevs on the target are EH-finished.
+
  	3. If !list_empty(&eh_work_q), invoke scsi_eh_bus_reset()
  
  	``scsi_eh_bus_reset``
diff --git a/Documentation/scsi/scsi_mid_low_api.rst b/Documentation/scsi/scsi_mid_low_api.rst
index 43083efc554b..8a075da5641b 100644
--- a/Documentation/scsi/scsi_mid_low_api.rst
+++ b/Documentation/scsi/scsi_mid_low_api.rst
@@ -758,6 +758,24 @@ Details::
  	int eh_bus_reset_handler(struct Scsi_Host * host, unsigned int channel)
  
  
+    /**
+    *      eh_target_reset_handler - issue SCSI target reset
+    *      @starget: identifies SCSI target to be reset
+    *
+    *      Returns SUCCESS if command aborted else FAILED

Same as in Patch 1. Or in a later patch, if you prefer.

Yes, I do :-)

+    *
+    *      Locks: None held
+    *
+    *      Calling context: kernel thread
+    *
+    *      Notes: Invoked from scsi_eh thread. No other commands will be
+    *      queued on current host during eh.
+    *
+    *      Optionally defined in: LLD
+    **/
+	int eh_target_reset_handler(struct scsi_target * starget)
+
+
      /**
      *      eh_device_reset_handler - issue SCSI device reset
      *      @scp: identifies SCSI device to be reset
diff --git a/drivers/message/fusion/mptsas.c b/drivers/message/fusion/mptsas.c
index 300f8e955a53..a9d274cabb37 100644
--- a/drivers/message/fusion/mptsas.c
+++ b/drivers/message/fusion/mptsas.c
@@ -2012,7 +2020,7 @@ static const struct scsi_host_template mptsas_driver_template = {
  	.change_queue_depth 		= mptscsih_change_queue_depth,
  	.eh_timed_out			= mptsas_eh_timed_out,
  	.eh_abort_handler		= mptscsih_abort,
-	.eh_device_reset_handler	= mptscsih_dev_reset,
+	.eh_target_reset_handler	= mptsas_eh_target_reset,

Huh, it that correct? Replace the device reset handler with the target
handler? That seems odd.

Please check drivers/message/fusion/mptscsih.c

mptscsih_dev_reset _is_ a target reset.

  	.eh_host_reset_handler		= mptscsih_host_reset,
  	.bios_param			= mptscsih_bios_param,
  	.can_queue			= MPT_SAS_CAN_QUEUE,
diff --git a/drivers/message/fusion/mptspi.c b/drivers/message/fusion/mptspi.c
index 6c5920db1e9d..d379dea7074c 100644
--- a/drivers/message/fusion/mptspi.c
+++ b/drivers/message/fusion/mptspi.c
@@ -834,7 +840,7 @@ static const struct scsi_host_template mptspi_driver_template = {
  	.slave_destroy			= mptspi_slave_destroy,
  	.change_queue_depth 		= mptscsih_change_queue_depth,
  	.eh_abort_handler		= mptscsih_abort,
-	.eh_device_reset_handler	= mptscsih_dev_reset,
+	.eh_target_reset_handler	= mptspi_eh_target_reset,

...

  	.eh_bus_reset_handler		= mptscsih_bus_reset,
  	.eh_host_reset_handler		= mptscsih_host_reset,
  	.bios_param			= mptscsih_bios_param,
diff --git a/drivers/s390/scsi/zfcp_scsi.c b/drivers/s390/scsi/zfcp_scsi.c
index 76c136d39bf1..5c9a7c9f9a98 100644
--- a/drivers/s390/scsi/zfcp_scsi.c
+++ b/drivers/s390/scsi/zfcp_scsi.c
@@ -340,9 +340,12 @@ static int zfcp_scsi_eh_device_reset_handler(struct scsi_cmnd *scpnt)
  	return zfcp_scsi_task_mgmt_function(sdev, FCP_TMF_LUN_RESET);
  }
  
-static int zfcp_scsi_eh_target_reset_handler(struct scsi_cmnd *scpnt)
+/*
+ * Note: We need to select a LUN as the storage array doesn't
+ * necessarily supports LUN 0 and might refuse the target reset.

That might be true, I don't really know, but it's not the reason why we added
this "search a sdev" logic to the function. The firmware for FCP channels
needs a valid "LUN handle" for each FCP request we send, otherwise it's
rejected. So we have to find a valid handle. This is an artifact from
HBA virtualisation before NPIV on Z.

There is more details in
674595d8519f ("scsi: zfcp: decouple our scsi_eh callbacks from scsi_cmnd")

 From my perspective you can just remove the comment, or change it to match
the part about `zfcp_scsi_eh_target_reset_handler()` in what Steffen wrote in
the commit.

Okay, will be doing so.

+ */
+static int zfcp_scsi_eh_target_reset_handler(struct scsi_target *starget)
  {
-	struct scsi_target *starget = scsi_target(scpnt->device);
  	struct fc_rport *rport = starget_to_rport(starget);
  	struct Scsi_Host *shost = rport_to_shost(rport);
  	struct scsi_device *sdev = NULL, *tmp_sdev;
diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
index 98aa17a6448a..244b7a6f8616 100644
--- a/drivers/scsi/lpfc/lpfc_scsi.c
+++ b/drivers/scsi/lpfc/lpfc_scsi.c
@@ -6045,15 +6045,15 @@ lpfc_device_reset_handler(struct scsi_cmnd *cmnd)
   *  0x2002 - Success
   **/
  static int
-lpfc_target_reset_handler(struct scsi_cmnd *cmnd)
+lpfc_target_reset_handler(struct scsi_target *starget)
  {
-	struct Scsi_Host  *shost = cmnd->device->host;
-	struct fc_rport *rport = starget_to_rport(scsi_target(cmnd->device));
+	struct fc_rport *rport = starget_to_rport(starget);
+	struct Scsi_Host  *shost = rport_to_shost(rport);
  	struct lpfc_vport *vport = (struct lpfc_vport *) shost->hostdata;
  	struct lpfc_rport_data *rdata;
  	struct lpfc_nodelist *pnode;
-	unsigned tgt_id = cmnd->device->id;
-	uint64_t lun_id = cmnd->device->lun;
+	unsigned tgt_id = starget->id;
+	uint64_t lun_id = 0;

Well, hopefully storage targets for LPFC can deal with LUN 0 :)

Sad to say, but it's only with zfcp where I've seen the 'WLUN' thingie 
being deployed. But alright, I'll check if I can grab a valid LUN ID.

  	struct lpfc_scsi_event_header scsi_event;
  	int status;
  	u32 logit = LOG_FCP;
diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index c60014e07b44..bee950f11576 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -4804,21 +4804,25 @@ int megasas_task_abort_fusion(struct scsi_cmnd *scmd)
  
  /*
   * megasas_reset_target_fusion : target reset function for fusion adapters
- * scmd: SCSI command pointer
+ * @shost: SCSI host pointer
+ * @starget: SCSI target pointer
   *
   * Returns SUCCESS if all commands associated with target aborted else FAILED
   */
  
-int megasas_reset_target_fusion(struct scsi_cmnd *scmd)
+int megasas_reset_target_fusion(struct Scsi_Host *shost,
+				struct scsi_target *starget)
  {
  
  	struct megasas_instance *instance;
+	struct scsi_device *sdev;
  	int ret = FAILED;
-	u16 devhandle;
-	struct MR_PRIV_DEVICE *mr_device_priv_data;
-	mr_device_priv_data = scmd->device->hostdata;
+	u16 devhandle = USHRT_MAX;
+	struct MR_PRIV_DEVICE *mr_device_priv_data = NULL;
  
-	instance = (struct megasas_instance *)scmd->device->host->hostdata;
+	instance = (struct megasas_instance *)shost->hostdata;

`shost_priv()`? And you can move that to the top as well I think.

Ok, will be doing that.

+	starget_printk(KERN_INFO, starget,
+		    "target reset called\n");
  
  	if (atomic_read(&instance->adprecovery) != MEGASAS_HBA_OPERATIONAL) {
  		dev_err(&instance->pdev->dev, "Controller is not OPERATIONAL,"
diff --git a/drivers/scsi/mpi3mr/mpi3mr_os.c b/drivers/scsi/mpi3mr/mpi3mr_os.c
index 381e07f2b718..63d95aa8a5f3 100644
--- a/drivers/scsi/mpi3mr/mpi3mr_os.c
+++ b/drivers/scsi/mpi3mr/mpi3mr_os.c
@@ -4090,54 +4090,41 @@ static int mpi3mr_eh_bus_reset(struct Scsi_Host *shost, unsigned int channel)
   * Return: SUCCESS of successful termination of the scmd else
   *         FAILED
   */
-static int mpi3mr_eh_target_reset(struct scsi_cmnd *scmd)
+static int mpi3mr_eh_target_reset(struct scsi_target *starget)
  {
-	struct mpi3mr_ioc *mrioc = shost_priv(scmd->device->host);
+	struct Scsi_Host *shost = dev_to_shost(&starget->dev);
+	struct mpi3mr_ioc *mrioc = shost_priv(shost);
  	struct mpi3mr_stgt_priv_data *stgt_priv_data;
-	struct mpi3mr_sdev_priv_data *sdev_priv_data;
  	u16 dev_handle;
  	u8 resp_code = 0;
  	int retval = FAILED, ret = 0;
  
-	sdev_printk(KERN_INFO, scmd->device,
-	    "Attempting Target Reset! scmd(%p)\n", scmd);
-	scsi_print_command(scmd);
-
-	sdev_priv_data = scmd->device->hostdata;
-	if (!sdev_priv_data || !sdev_priv_data->tgt_priv_data) {
-		sdev_printk(KERN_INFO, scmd->device,
-		    "SCSI device is not available\n");
-		retval = SUCCESS;
-		goto out;
-	}
+	starget_printk(KERN_INFO, starget,
+	    "Attempting Target Reset!\n");

Nitpick: you can remove the line-break.

Yes, and no. I had been debating with me whether I really wanted
to do that. Linebreaks are in nearly all of the debugging messages
in this driver, so once I start removing them here questions will
be asked why I removed it _just_ here, or, why I removed it at all.

I'd prefer doing that in a separate patch.

  
-	stgt_priv_data = sdev_priv_data->tgt_priv_data;
+	stgt_priv_data = starget->hostdata;
  	dev_handle = stgt_priv_data->dev_handle;
  	if (stgt_priv_data->dev_removed) {
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index f82551783feb..fc0510cd367c 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1662,7 +1664,8 @@ static int scsi_eh_target_reset(struct Scsi_Host *shost,
  					     " target: %d\n",
  					     current->comm, id));
  		list_for_each_entry_safe(scmd, next, &tmp_list, eh_entry) {
-			if (scmd_id(scmd) != id)
+			if (scmd_channel(scmd) != channel ||
+			    scmd_id(scmd) != id)

Hmm, interesting. So this was broken before as well?

Tbh., this seems like a fix that could go to stable? Should we have a separate
patch for it?

Technically, yes. In practise it'll be hard to trigger, so no-one has 
seen this issue so far.

But good point, I'll be splitting it out into a separate patch.

Cheers,

Hannes
--
Dr. Hannes Reinecke                Kernel Storage Architect
hare@xxxxxxx                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman