Hello,
If a task management function is issued, eg using sg_reset utility (the
easiest way), during active IO to qla2xxx device (ISP2422), it often
fails with messages like:
------------------------------------------------------------------
qla2xxx 0000:04:02.0: scsi(13:0:1): DEVICE RESET ISSUED.
qla2xxx 0000:04:02.0: qla2xxx_eh_device_reset: failed while waiting for
commands
------------------------------------------------------------------
This could lead to broken SCSI mid-level's error recovery and
erroneously making the device(es) offline, when they are actually healthy.
I did some investigations and figured out that the driver waits some
time for the firmware to finish aborting the outstanding commands with
CS_ABORTED status and if at least one command isn't finished until
timeout, FAILED is returned.
The problem is how the wait is implemented. Here is the code:
------------------------------------------------------------------
static int
qla2x00_eh_wait_on_command(scsi_qla_host_t *ha, struct scsi_cmnd *cmd)
{
#define ABORT_POLLING_PERIOD 1000
#define ABORT_WAIT_ITER ((10 * 1000) / (ABORT_POLLING_PERIOD))
unsigned long wait_iter = ABORT_WAIT_ITER;
int ret = QLA_SUCCESS;
while (CMD_SP(cmd)) {
msleep(ABORT_POLLING_PERIOD);
if (--wait_iter)
break;
}
if (CMD_SP(cmd))
ret = QLA_FUNCTION_FAILED;
return ret;
}
------------------------------------------------------------------
Where CMD_SP() is defined as
#define CMD_SP(Cmnd) ((Cmnd)->SCp.ptr)
It's set to NULL just before cmd->scsi_done() is called.
You can see that this way of waiting has a race with the SCSI mid-level,
where it can free and reuse the command while
qla2x00_eh_wait_on_command() is sleeping in msleep(), so SCp.ptr can
become non-NULL again, which could lead to the above false errors.
Regards,
Vlad
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html