Hi Don, some comments inline. On 04/16/2015 03:46 PM, Don Brace wrote: > From: Webb Scales <webb.scales@xxxxxx> > > Allow driver initiated commands to have a timeout. It does not > yet try to do anything with timeouts on such commands. > > We are sending a reset in order to get rid of a command we want to abort. > If we make it return on the same reply queue as the command we want to abort, > the completion of the aborted command will not race with the completion of > the reset command. > > Rename hpsa_scsi_do_simple_cmd_core() to hpsa_scsi_do_simple_cmd(), since > this function is the interface for issuing commands to the controller and > not the "core" of that implementation. Add a parameter to it which allows > the caller to specify the reply queue to be used. Modify existing callers > to specify the default reply queue. > > Rename __hpsa_scsi_do_simple_cmd_core() to hpsa_scsi_do_simple_cmd_core(), > since this routine is the "core" implementation of the "do simple command" > function and there is no longer any other function with a similar name. > Modify the existing callers of this routine (other than > hpsa_scsi_do_simple_cmd()) to instead call hpsa_scsi_do_simple_cmd(), since > it will now accept the reply_queue paramenter, and it provides a controller > lock-up check. (Also, tweak two related message strings to make them > distinct from each other.) > > Submitting a command to a locked up controller always results in a timeout, > so check for controller lock-up before submitting. > > This is to enable fixing a race between command completions and > abort completions on different reply queues in a subsequent patch. > We want to be able to specify which reply queue an abort completion > should occur on so that it cannot race the completion of the command > it is trying to abort. > > The following race was possible in theory: > > 1. Abort command is sent to hardware. > 2. Command to be aborted simultaneously completes on another > reply queue. > 3. Hardware receives abort command, decides command has already > completed and indicates this to the driver via another different > reply queue. > 4. driver processes abort completion finds that the hardware does not know > about the command, concludes that therefore the command cannot complete, > returns SUCCESS indicating to the mid-layer that the scsi_cmnd may be > re-used. > 5. Command from step 2 is processed and completed back to scsi mid > layer (after we already promised that would never happen.) > > Fix by forcing aborts to complete on the same reply queue as the command > they are aborting. > > Piggybacking device rescanning functionality onto the lockup > detection thread is not a good idea because if the controller > locks up during device rescanning, then the thread could get > stuck, then the lockup isn't detected. Use separate work > queues for device rescanning and lockup detection. > > Detect controller lockup in abort handler. > > After a lockup is detected, return DO_NO_CONNECT which results in immediate > termination of commands rather than DID_ERR which results in retries. > > Modify detect_controller_lockup() to return the result, to remove the need for a separate check. > > Reviewed-by: Scott Teel <scott.teel@xxxxxxxx> > Reviewed-by: Kevin Barnett <kevin.barnett@xxxxxxxx> > Signed-off-by: Webb Scales <webbnh@xxxxxx> > Signed-off-by: Don Brace <don.brace@xxxxxxxx> > --- > drivers/scsi/hpsa.c | 329 ++++++++++++++++++++++++++++++++++++----------- > drivers/scsi/hpsa_cmd.h | 5 + > 2 files changed, 257 insertions(+), 77 deletions(-) > [ .. ] > @@ -4375,13 +4469,46 @@ static int hpsa_eh_device_reset_handler(struct scsi_cmnd *scsicmd) > "device lookup failed.\n"); > return FAILED; > } > - dev_warn(&h->pdev->dev, "resetting device %d:%d:%d:%d\n", > - h->scsi_host->host_no, dev->bus, dev->target, dev->lun); > + > + /* if controller locked up, we can guarantee command won't complete */ > + if (lockup_detected(h)) { > + dev_warn(&h->pdev->dev, > + "scsi %d:%d:%d:%d RESET FAILED, lockup detected\n", > + h->scsi_host->host_no, dev->bus, dev->target, > + dev->lun); > + return FAILED; > + } > + > + /* this reset request might be the result of a lockup; check */ > + if (detect_controller_lockup(h)) { > + dev_warn(&h->pdev->dev, > + "scsi %d:%d:%d:%d RESET FAILED, new lockup detected\n", > + h->scsi_host->host_no, dev->bus, dev->target, > + dev->lun); > + return FAILED; > + } > + > + dev_warn(&h->pdev->dev, > + "scsi %d:%d:%d:%d: %s %.8s %.16s resetting RAID-%s SSDSmartPathCap%c En%c Exp=%d\n", > + h->scsi_host->host_no, dev->bus, dev->target, dev->lun, > + scsi_device_type(dev->devtype), > + dev->vendor, > + dev->model, > + dev->raid_level > RAID_UNKNOWN ? > + "RAID-?" : raid_label[dev->raid_level], > + dev->offload_config ? '+' : '-', > + dev->offload_enabled ? '+' : '-', > + dev->expose_state); > + Didn't you just reworked the message logging in the previous patch? Why can't you use it here? > /* send a reset to the SCSI LUN which the command was sent to */ > - rc = hpsa_send_reset(h, dev->scsi3addr, HPSA_RESET_TYPE_LUN); > + rc = hpsa_send_reset(h, dev->scsi3addr, HPSA_RESET_TYPE_LUN, > + DEFAULT_REPLY_QUEUE); > if (rc == 0 && wait_for_device_to_become_ready(h, dev->scsi3addr) == 0) > return SUCCESS; > > + dev_warn(&h->pdev->dev, > + "scsi %d:%d:%d:%d reset failed\n", > + h->scsi_host->host_no, dev->bus, dev->target, dev->lun); > return FAILED; > } > > @@ -4426,7 +4553,7 @@ static void hpsa_get_tag(struct ctlr_info *h, > } > > static int hpsa_send_abort(struct ctlr_info *h, unsigned char *scsi3addr, > - struct CommandList *abort, int swizzle) > + struct CommandList *abort, int swizzle, int reply_queue) > { > int rc = IO_OK; > struct CommandList *c; > @@ -4444,9 +4571,9 @@ static int hpsa_send_abort(struct ctlr_info *h, unsigned char *scsi3addr, > 0, 0, scsi3addr, TYPE_MSG); > if (swizzle) > swizzle_abort_tag(&c->Request.CDB[4]); > - hpsa_scsi_do_simple_cmd_core(h, c); > + (void) hpsa_scsi_do_simple_cmd(h, c, reply_queue, NO_TIMEOUT); > hpsa_get_tag(h, abort, &taglower, &tagupper); > - dev_dbg(&h->pdev->dev, "%s: Tag:0x%08x:%08x: do_simple_cmd_core completed.\n", > + dev_dbg(&h->pdev->dev, "%s: Tag:0x%08x:%08x: do_simple_cmd(abort) completed.\n", > __func__, tagupper, taglower); > /* no unmap needed here because no data xfer. */ > > @@ -4478,7 +4605,7 @@ static int hpsa_send_abort(struct ctlr_info *h, unsigned char *scsi3addr, > */ > > static int hpsa_send_reset_as_abort_ioaccel2(struct ctlr_info *h, > - unsigned char *scsi3addr, struct CommandList *abort) > + unsigned char *scsi3addr, struct CommandList *abort, int reply_queue) > { > int rc = IO_OK; > struct scsi_cmnd *scmd; /* scsi command within request being aborted */ > @@ -4521,7 +4648,7 @@ static int hpsa_send_reset_as_abort_ioaccel2(struct ctlr_info *h, > "Reset as abort: Resetting physical device at scsi3addr 0x%02x%02x%02x%02x%02x%02x%02x%02x\n", > psa[0], psa[1], psa[2], psa[3], > psa[4], psa[5], psa[6], psa[7]); > - rc = hpsa_send_reset(h, psa, HPSA_RESET_TYPE_TARGET); > + rc = hpsa_send_reset(h, psa, HPSA_RESET_TYPE_TARGET, reply_queue); > if (rc != 0) { > dev_warn(&h->pdev->dev, > "Reset as abort: Failed on physical device at scsi3addr 0x%02x%02x%02x%02x%02x%02x%02x%02x\n", > @@ -4555,7 +4682,7 @@ static int hpsa_send_reset_as_abort_ioaccel2(struct ctlr_info *h, > * make this true someday become false. > */ > static int hpsa_send_abort_both_ways(struct ctlr_info *h, > - unsigned char *scsi3addr, struct CommandList *abort) > + unsigned char *scsi3addr, struct CommandList *abort, int reply_queue) > { > /* ioccelerator mode 2 commands should be aborted via the > * accelerated path, since RAID path is unaware of these commands, > @@ -4563,10 +4690,20 @@ static int hpsa_send_abort_both_ways(struct ctlr_info *h, > * Change abort to physical device reset. > */ > if (abort->cmd_type == CMD_IOACCEL2) > - return hpsa_send_reset_as_abort_ioaccel2(h, scsi3addr, abort); > + return hpsa_send_reset_as_abort_ioaccel2(h, scsi3addr, > + abort, reply_queue); > + > + return hpsa_send_abort(h, scsi3addr, abort, 0, reply_queue) && > + hpsa_send_abort(h, scsi3addr, abort, 1, reply_queue); > +} > > - return hpsa_send_abort(h, scsi3addr, abort, 0) && > - hpsa_send_abort(h, scsi3addr, abort, 1); > +/* Find out which reply queue a command was meant to return on */ > +static int hpsa_extract_reply_queue(struct ctlr_info *h, > + struct CommandList *c) > +{ > + if (c->cmd_type == CMD_IOACCEL2) > + return h->ioaccel2_cmd_pool[c->cmdindex].reply_queue; > + return c->Header.ReplyQueue; > } > > /* Send an abort for the specified command. > @@ -4584,7 +4721,7 @@ static int hpsa_eh_abort_handler(struct scsi_cmnd *sc) > char msg[256]; /* For debug messaging. */ > int ml = 0; > __le32 tagupper, taglower; > - int refcount; > + int refcount, reply_queue; > > /* Find the controller of the command to be aborted */ > h = sdev_to_hba(sc->device); > @@ -4592,8 +4729,23 @@ static int hpsa_eh_abort_handler(struct scsi_cmnd *sc) > "ABORT REQUEST FAILED, Controller lookup failed.\n")) > return FAILED; > > - if (lockup_detected(h)) > + /* If controller locked up, we can guarantee command won't complete */ > + if (lockup_detected(h)) { > + dev_warn(&h->pdev->dev, > + "scsi %d:%d:%d:%llu scmd %p ABORT FAILED, lockup detected\n", > + h->scsi_host->host_no, sc->device->channel, > + sc->device->id, sc->device->lun, sc); > return FAILED; > + } > + > + /* This is a good time to check if controller lockup has occurred */ > + if (detect_controller_lockup(h)) { > + dev_warn(&h->pdev->dev, > + "scsi %d:%d:%d:%llu scmd %p ABORT FAILED, new lockup detected\n", > + h->scsi_host->host_no, sc->device->channel, > + sc->device->id, sc->device->lun, sc); > + return FAILED; > + } > Same here. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@xxxxxxx +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html