On 10/23/2013 09:46 AM, James Bottomley wrote:
On Wed, 2013-10-23 at 11:25 +0200, Hannes Reinecke wrote:
On 10/16/2013 09:22 PM, James Bottomley wrote:
On Mon, 2013-07-01 at 08:50 +0200, Hannes Reinecke wrote:
This patchs adds an 'eh_deadline' sysfs attribute to the scsi
host which limits the overall runtime of the SCSI EH.
The 'eh_deadline' value is stored in the now obsolete field
'resetting'.
When a command is failed the start time of the EH is stored
in 'last_reset'. If the overall runtime of the SCSI EH is longer
than last_reset + eh_deadline, the EH is short-circuited and
falls through to issue a host reset only.
OK, so the specific problem with this one is that potentially it will
spend all its time mucking about with aborts (which most often time out
on non FC kit because of the issue problems) and then proceed to host
reset, which mostly does nothing for failing devices.
If you want to impose a deadline, then we need to spend only 50% of the
time attempting aborts and the rest of the time escalating the resets.
[...]
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index f43de1e..84369f2 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -89,6 +89,18 @@ void scsi_schedule_eh(struct Scsi_Host *shost)
}
EXPORT_SYMBOL_GPL(scsi_schedule_eh);
+static int scsi_host_eh_past_deadline(struct Scsi_Host *shost)
+{
+ if (!shost->last_reset || !shost->eh_deadline)
+ return 0;
+
+ if (time_before(jiffies,
+ shost->last_reset + shost->eh_deadline))
+ return 0;
+
+ return 1;
+}
+
What about instead:
static int scsi_host_eh_past_deadline(struct Scsi_Host *shost, int percent) {
if (!shost->last_reset || !shost->eh_deadline)
return 0;
if (time_before(jiffies,
shost->last_reset + shost->eh_deadline * percent/100))
return 0;
return 1;
}
which allows us to have
if (scsi_host_eh_past_deadline(shost, 50)) {
in scsi_eh_abort_cmds()
if (scsi_host_eh_past_deadline(shost, 66) {
in scsi_eh_bus_device_reset()
say 83 in target reset, and 100 in bus reset.
Thus ensuring we at least get a crack at the reset chain?
Alternatively we could just escalate directly to LUN reset once the
first abort fails. That looks like a cleaner solution here.
I'm OK with that ... if you want this in the current merge window, you
have about a week to code it up ...
Patch is already compiling :-)
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@xxxxxxx +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html