There's a curious case where devices in clusters are offlining if they go into error handling. The reason is that in this particular cluster, Test Unit Ready gets a RESERVATION CONFLICT return when another node owns the storage. This means that all TURs that error handling use are marked failed, so we always assume the device is unrecoverable and take it offline. Fix this by checking in the error handling code processing returns to see if the command was a TUR and translate the EH return to SUCCESS (after all, if the target managed to return RESERVATION CONFLICT, we've successfully made contact with it). James --- diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 2bf9846..5e2d36f 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -473,10 +473,12 @@ static int scsi_eh_completed_normally(struct scsi_cmnd *scmd) */ return SUCCESS; case RESERVATION_CONFLICT: - /* - * let issuer deal with this, it could be just fine - */ - return SUCCESS; + if (scmd->cmnd[0] == TEST_UNIT_READY) + /* it is a success, we probed the device and + * found it */ + return SUCCESS; + /* otherwise, we failed to send the command */ + return FAILED; case QUEUE_FULL: scsi_handle_queue_full(scmd->device); /* fall through */ -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html