On 10/07/2010 12:49 AM, Vikas Chaudhary wrote:
From: Nilesh Javali<nilesh.javali@xxxxxxxxxx> Do not wait for the outstanding commands to complete in case of firmware hang. Signed-off-by: Nilesh Javali<nilesh.javali@xxxxxxxxxx> Signed-off-by: Vikas Chaudhary<vikas.chaudhary@xxxxxxxxxx> Signed-off-by: Ravi Anand<ravi.anand@xxxxxxxxxx> --- drivers/scsi/qla4xxx/ql4_os.c | 6 ++++-- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/qla4xxx/ql4_os.c b/drivers/scsi/qla4xxx/ql4_os.c index 56962e5..a6455fb 100644 --- a/drivers/scsi/qla4xxx/ql4_os.c +++ b/drivers/scsi/qla4xxx/ql4_os.c @@ -1102,7 +1102,8 @@ static int qla4xxx_recover_adapter(struct scsi_qla_host *ha) ha->host_no, __func__)); status = ha->isp_ops->reset_firmware(ha); if (status == QLA_SUCCESS) { - qla4xxx_cmd_wait(ha); + if (!test_bit(AF_FW_RECOVERY,&ha->flags)) + qla4xxx_cmd_wait(ha); ha->isp_ops->disable_intrs(ha); qla4xxx_process_aen(ha, FLUSH_DDB_CHANGED_AENS); qla4xxx_abort_active_cmds(ha, DID_RESET<< 16); @@ -1119,7 +1120,8 @@ static int qla4xxx_recover_adapter(struct scsi_qla_host *ha) * or if stop_firmware fails for ISP-82xx. * This is the default case for ISP-4xxx */ if (!is_qla8022(ha) || reset_chip) { - qla4xxx_cmd_wait(ha); + if (!test_bit(AF_FW_RECOVERY,&ha->flags)) + qla4xxx_cmd_wait(ha); qla4xxx_process_aen(ha, FLUSH_DDB_CHANGED_AENS); qla4xxx_abort_active_cmds(ha, DID_RESET<< 16); DEBUG2(ql4_printk(KERN_INFO, ha,
If we go from qla4xxx_eh_host_reset->qla4xxx_recover_adapter and you do not wait, is is possible for the driver to be accessing scsi commands after qla4xxx_eh_host_reset returns? If so I think there is a problem where the scsi eh will fail or retry the cmd so the memory for the scsi command could be reallocated/freed.
On a related note, has qla4xxx_eh_host_reset ever returned SUCCESS? I think there is a problem in qla4xxx_cmd_wait when the scsi eh is running. At that time blk_finish_request->blk_queue_end_tag is not going to be run, because if the driver were to call scsi_done the block layer would do blk_complete_request and the blk_mark_rq_complete would fail (this is because the scsi/block eh timeout could has already marked it complete).
I think you need the attached patch.
qla4xxx: Fix cmd check in qla4xxx_cmd_wait If the command has timedout then the block layer has called blk_mark_rq_complete. If qla4xxx_cmd_wait is then called from qla4xxx_eh_host_reset, we will always fail, because if the driver calls scsi_done then the the block layer will fail at blk_complete_request's blk_mark_rq_complete call instead of calling the normal completion path including the function, blk_queue_end_tag, which releases the tag. Patch is not tested. Signed-off-by: Mike Christie <michaelc@xxxxxxxxxxx> diff --git a/drivers/scsi/qla4xxx/ql4_os.c b/drivers/scsi/qla4xxx/ql4_os.c index 370d40f..97f6d68 100644 --- a/drivers/scsi/qla4xxx/ql4_os.c +++ b/drivers/scsi/qla4xxx/ql4_os.c @@ -885,7 +885,13 @@ static int qla4xxx_cmd_wait(struct scsi_qla_host *ha) /* Find a command that hasn't completed. */ for (index = 0; index < ha->host->can_queue; index++) { cmd = scsi_host_find_tag(ha->host, index); - if (cmd != NULL) + /* + * We cannot just check if the index is valid, + * becase if we are run from the scsi eh, then + * the scsi/block layer is going to prevent + * the tag from being released. + */ + if (cmd != NULL && CMD_SP(cmd)) break; } spin_unlock_irqrestore(&ha->hardware_lock, flags);