Re: [PATCH] scsi: Allow error handling timeout to be specified

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 05/10/2013 07:51 PM, Baruch Even wrote:
On Fri, May 10, 2013 at 5:01 PM, Ewan Milne <emilne@xxxxxxxxxx> wrote:
On Fri, 2013-05-10 at 16:22 +0300, Baruch Even wrote:
On Fri, May 10, 2013 at 3:43 PM, Ewan Milne <emilne@xxxxxxxxxx> wrote:

On Thu, 2013-05-09 at 23:11 -0400, Martin K. Petersen wrote:
Introduce eh_timeout which can be used for error handling purposes. This
was previously hardcoded to 10 seconds in the SCSI error handling
code. However, for some fast-fail scenarios it is necessary to be able
to tune this as it can take several iterations (bus device, target, bus,
controller) before we give up.

Signed-off-by: Martin K. Petersen <martin.petersen@xxxxxxxxxx>


Thanks for posting this.  It will be very helpful to have this
capability, particularly when alternate paths to the device exist.

Acked-by: Ewan D. Milne <emilne@xxxxxxxxxx>


I would argue that waiting for the eh to timeout before you switch to
another path is most likely to be wrong. If you did the first pass of
error recovery (task abort) and that failed the
path/hba/logical-device is doomed. If you will switch to another path
it will either work (meaning the path/hba were bad) or not (logical
device was the culprit).

It is necessary to either know the disposition of a command or
else wait for a defined amount of time before retrying the command on
another path.  Otherwise you run the risk that the command will
eventually complete on the first path.  So yes, we need to do the abort
(and its timeout).


Actually reducing the timeouts is probably not a good approach since
it will cause the host to take a more radical approach without waiting
sufficiently for a potential recovery. In addition the more radical
error handlings such as host reset will destroy other paths for
completely unrelated devices/links, from my experience a host reset is
usually not required and the Linux kernel currently reaches to this
big hammer too fast.

I believe that Hannes is working on a better error handling algorithm
that e.g. does not cause an emulated bus reset in an FC environment
by resetting all the targets (and affecting I/O to unrelated targets in
the process).

The error handling I have in mind (admittedly, not fully thought out)
should work for both FC and SAS. Currently the error recovery
progresses at the host level regardless of if the errors are on one
device or all of them, it also stops the IOs on all devices and LUNs.
It would be nice if that was taken into account. My ideas may be more
suitable to the environment I work in (enterprise storage devices
rather than hosts) but I believe the same approach would benefit the
hosts as well.

It would be interesting to see what approach the new error handling will take.

So, my general idea is this:

1) Send command aborts from scsi_times_out(). There is no requirement
   on stopping I/O on the host simply because a single command times
   out. And as scsi_times_out() is run from a separate thread anyway
   we should be able to send ABORT TASK TMFs without a problem
2) Modify recovery sequence.
   One of the major pitfalls of the current scsi_eh is that it
   spills over onto unrelated LUNs for higher levels. So for the
   new EH we should be using a sequence of
   - ABORT TASK
   - ABORT TASK SET
   - (Terminate I_T nexus)
   - (Host reset)
   'Terminate I_T nexus' for FibreChannel is equivalent to a LOGO.
   'Host reset' is the current host reset function.
3) Finegrained recovery setting.
   There is no need to stop the entire host when doing a recovery;
   it should be sufficient to stop I/O to the unit
   (LUN, I_T nexus, host) when the error recovery is at the
   respective level.

As usual, comments are welcome.

Cheers,

Hannes

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux