Re: [PATCH v4 03/43] hpsa: rework controller command submission

Hannes Reinecke <hare@xxxxxxx> · Fri, 17 Apr 2015 14:57:57 +0200

Hi Don,

some comments inline.

On 04/16/2015 03:46 PM, Don Brace wrote:
> From: Webb Scales <webb.scales@xxxxxx>
> 
> Allow driver initiated commands to have a timeout.  It does not
> yet try to do anything with timeouts on such commands.
> 
> We are sending a reset in order to get rid of a command we want to abort.
> If we make it return on the same reply queue as the command we want to abort,
> the completion of the aborted command will not race with the completion of
> the reset command.
> 
> Rename hpsa_scsi_do_simple_cmd_core() to hpsa_scsi_do_simple_cmd(), since
> this function is the interface for issuing commands to the controller and
> not the "core" of that implementation.  Add a parameter to it which allows
> the caller to specify the reply queue to be used.  Modify existing callers
> to specify the default reply queue.
> 
> Rename __hpsa_scsi_do_simple_cmd_core() to hpsa_scsi_do_simple_cmd_core(),
> since this routine is the "core" implementation of the "do simple command"
> function and there is no longer any other function with a similar name.
> Modify the existing callers of this routine (other than
> hpsa_scsi_do_simple_cmd()) to instead call hpsa_scsi_do_simple_cmd(), since
> it will now accept the reply_queue paramenter, and it provides a controller
> lock-up check.  (Also, tweak two related message strings to make them
> distinct from each other.)
> 
> Submitting a command to a locked up controller always results in a timeout,
> so check for controller lock-up before submitting.
> 
> This is to enable fixing a race between command completions and
> abort completions on different reply queues in a subsequent patch.
> We want to be able to specify which reply queue an abort completion
> should occur on so that it cannot race the completion of the command
> it is trying to abort.
> 
> The following race was possible in theory:
> 
>   1. Abort command is sent to hardware.
>   2. Command to be aborted simultaneously completes on another
>      reply queue.
>   3. Hardware receives abort command, decides command has already
>      completed and indicates this to the driver via another different
>      reply queue.
>   4. driver processes abort completion finds that the hardware does not know
>      about the command, concludes that therefore the command cannot complete,
>      returns SUCCESS indicating to the mid-layer that the scsi_cmnd may be
>      re-used.
>   5. Command from step 2 is processed and completed back to scsi mid
>      layer (after we already promised that would never happen.)
> 
> Fix by forcing aborts to complete on the same reply queue as the command
> they are aborting.
> 
> Piggybacking device rescanning functionality onto the lockup
> detection thread is not a good idea because if the controller
> locks up during device rescanning, then the thread could get
> stuck, then the lockup isn't detected.  Use separate work
> queues for device rescanning and lockup detection.
> 
> Detect controller lockup in abort handler.
> 
> After a lockup is detected, return DO_NO_CONNECT which results in immediate
> termination of commands rather than DID_ERR which results in retries.
> 
> Modify detect_controller_lockup() to return the result, to remove the need for a separate check.
> 
> Reviewed-by: Scott Teel <scott.teel@xxxxxxxx>
> Reviewed-by: Kevin Barnett <kevin.barnett@xxxxxxxx>
> Signed-off-by: Webb Scales <webbnh@xxxxxx>
> Signed-off-by: Don Brace <don.brace@xxxxxxxx>
> ---
>  drivers/scsi/hpsa.c     |  329 ++++++++++++++++++++++++++++++++++++-----------
>  drivers/scsi/hpsa_cmd.h |    5 +
>  2 files changed, 257 insertions(+), 77 deletions(-)
> 
[ .. ]
> @@ -4375,13 +4469,46 @@ static int hpsa_eh_device_reset_handler(struct scsi_cmnd *scsicmd)
>  			"device lookup failed.\n");
>  		return FAILED;
>  	}
> -	dev_warn(&h->pdev->dev, "resetting device %d:%d:%d:%d\n",
> -		h->scsi_host->host_no, dev->bus, dev->target, dev->lun);
> +
> +	/* if controller locked up, we can guarantee command won't complete */
> +	if (lockup_detected(h)) {
> +		dev_warn(&h->pdev->dev,
> +			"scsi %d:%d:%d:%d RESET FAILED, lockup detected\n",
> +			h->scsi_host->host_no, dev->bus, dev->target,
> +			dev->lun);
> +		return FAILED;
> +	}
> +
> +	/* this reset request might be the result of a lockup; check */
> +	if (detect_controller_lockup(h)) {
> +		dev_warn(&h->pdev->dev,
> +			 "scsi %d:%d:%d:%d RESET FAILED, new lockup detected\n",
> +			 h->scsi_host->host_no, dev->bus, dev->target,
> +			 dev->lun);
> +		return FAILED;
> +	}
> +
> +	dev_warn(&h->pdev->dev,
> +		"scsi %d:%d:%d:%d: %s %.8s %.16s resetting RAID-%s SSDSmartPathCap%c En%c Exp=%d\n",
> +		h->scsi_host->host_no, dev->bus, dev->target, dev->lun,
> +		scsi_device_type(dev->devtype),
> +		dev->vendor,
> +		dev->model,
> +		dev->raid_level > RAID_UNKNOWN ?
> +			"RAID-?" : raid_label[dev->raid_level],
> +		dev->offload_config ? '+' : '-',
> +		dev->offload_enabled ? '+' : '-',
> +		dev->expose_state);
> +
Didn't you just reworked the message logging in the previous patch?
Why can't you use it here?

>  	/* send a reset to the SCSI LUN which the command was sent to */
> -	rc = hpsa_send_reset(h, dev->scsi3addr, HPSA_RESET_TYPE_LUN);
> +	rc = hpsa_send_reset(h, dev->scsi3addr, HPSA_RESET_TYPE_LUN,
> +			     DEFAULT_REPLY_QUEUE);
>  	if (rc == 0 && wait_for_device_to_become_ready(h, dev->scsi3addr) == 0)
>  		return SUCCESS;
>  
> +	dev_warn(&h->pdev->dev,
> +		"scsi %d:%d:%d:%d reset failed\n",
> +		h->scsi_host->host_no, dev->bus, dev->target, dev->lun);
>  	return FAILED;
>  }
>  
> @@ -4426,7 +4553,7 @@ static void hpsa_get_tag(struct ctlr_info *h,
>  }
>  
>  static int hpsa_send_abort(struct ctlr_info *h, unsigned char *scsi3addr,
> -	struct CommandList *abort, int swizzle)
> +	struct CommandList *abort, int swizzle, int reply_queue)
>  {
>  	int rc = IO_OK;
>  	struct CommandList *c;
> @@ -4444,9 +4571,9 @@ static int hpsa_send_abort(struct ctlr_info *h, unsigned char *scsi3addr,
>  		0, 0, scsi3addr, TYPE_MSG);
>  	if (swizzle)
>  		swizzle_abort_tag(&c->Request.CDB[4]);
> -	hpsa_scsi_do_simple_cmd_core(h, c);
> +	(void) hpsa_scsi_do_simple_cmd(h, c, reply_queue, NO_TIMEOUT);
>  	hpsa_get_tag(h, abort, &taglower, &tagupper);
> -	dev_dbg(&h->pdev->dev, "%s: Tag:0x%08x:%08x: do_simple_cmd_core completed.\n",
> +	dev_dbg(&h->pdev->dev, "%s: Tag:0x%08x:%08x: do_simple_cmd(abort) completed.\n",
>  		__func__, tagupper, taglower);
>  	/* no unmap needed here because no data xfer. */
>  
> @@ -4478,7 +4605,7 @@ static int hpsa_send_abort(struct ctlr_info *h, unsigned char *scsi3addr,
>   */
>  
>  static int hpsa_send_reset_as_abort_ioaccel2(struct ctlr_info *h,
> -	unsigned char *scsi3addr, struct CommandList *abort)
> +	unsigned char *scsi3addr, struct CommandList *abort, int reply_queue)
>  {
>  	int rc = IO_OK;
>  	struct scsi_cmnd *scmd; /* scsi command within request being aborted */
> @@ -4521,7 +4648,7 @@ static int hpsa_send_reset_as_abort_ioaccel2(struct ctlr_info *h,
>  			"Reset as abort: Resetting physical device at scsi3addr 0x%02x%02x%02x%02x%02x%02x%02x%02x\n",
>  			psa[0], psa[1], psa[2], psa[3],
>  			psa[4], psa[5], psa[6], psa[7]);
> -	rc = hpsa_send_reset(h, psa, HPSA_RESET_TYPE_TARGET);
> +	rc = hpsa_send_reset(h, psa, HPSA_RESET_TYPE_TARGET, reply_queue);
>  	if (rc != 0) {
>  		dev_warn(&h->pdev->dev,
>  			"Reset as abort: Failed on physical device at scsi3addr 0x%02x%02x%02x%02x%02x%02x%02x%02x\n",
> @@ -4555,7 +4682,7 @@ static int hpsa_send_reset_as_abort_ioaccel2(struct ctlr_info *h,
>   * make this true someday become false.
>   */
>  static int hpsa_send_abort_both_ways(struct ctlr_info *h,
> -	unsigned char *scsi3addr, struct CommandList *abort)
> +	unsigned char *scsi3addr, struct CommandList *abort, int reply_queue)
>  {
>  	/* ioccelerator mode 2 commands should be aborted via the
>  	 * accelerated path, since RAID path is unaware of these commands,
> @@ -4563,10 +4690,20 @@ static int hpsa_send_abort_both_ways(struct ctlr_info *h,
>  	 * Change abort to physical device reset.
>  	 */
>  	if (abort->cmd_type == CMD_IOACCEL2)
> -		return hpsa_send_reset_as_abort_ioaccel2(h, scsi3addr, abort);
> +		return hpsa_send_reset_as_abort_ioaccel2(h, scsi3addr,
> +							abort, reply_queue);
> +
> +	return hpsa_send_abort(h, scsi3addr, abort, 0, reply_queue) &&
> +			hpsa_send_abort(h, scsi3addr, abort, 1, reply_queue);
> +}
>  
> -	return hpsa_send_abort(h, scsi3addr, abort, 0) &&
> -			hpsa_send_abort(h, scsi3addr, abort, 1);
> +/* Find out which reply queue a command was meant to return on */
> +static int hpsa_extract_reply_queue(struct ctlr_info *h,
> +					struct CommandList *c)
> +{
> +	if (c->cmd_type == CMD_IOACCEL2)
> +		return h->ioaccel2_cmd_pool[c->cmdindex].reply_queue;
> +	return c->Header.ReplyQueue;
>  }
>  
>  /* Send an abort for the specified command.
> @@ -4584,7 +4721,7 @@ static int hpsa_eh_abort_handler(struct scsi_cmnd *sc)
>  	char msg[256];		/* For debug messaging. */
>  	int ml = 0;
>  	__le32 tagupper, taglower;
> -	int refcount;
> +	int refcount, reply_queue;
>  
>  	/* Find the controller of the command to be aborted */
>  	h = sdev_to_hba(sc->device);
> @@ -4592,8 +4729,23 @@ static int hpsa_eh_abort_handler(struct scsi_cmnd *sc)
>  			"ABORT REQUEST FAILED, Controller lookup failed.\n"))
>  		return FAILED;
>  
> -	if (lockup_detected(h))
> +	/* If controller locked up, we can guarantee command won't complete */
> +	if (lockup_detected(h)) {
> +		dev_warn(&h->pdev->dev,
> +			"scsi %d:%d:%d:%llu scmd %p ABORT FAILED, lockup detected\n",
> +			h->scsi_host->host_no, sc->device->channel,
> +			sc->device->id, sc->device->lun, sc);
>  		return FAILED;
> +	}
> +
> +	/* This is a good time to check if controller lockup has occurred */
> +	if (detect_controller_lockup(h)) {
> +		dev_warn(&h->pdev->dev,
> +			 "scsi %d:%d:%d:%llu scmd %p ABORT FAILED, new lockup detected\n",
> +			 h->scsi_host->host_no, sc->device->channel,
> +			 sc->device->id, sc->device->lun, sc);
> +		return FAILED;
> +	}
>  
Same here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare@xxxxxxx			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html