Re: [PATCH] scsi: Allow error handling timeout to be specified

Hannes Reinecke <hare@xxxxxxx> · Mon, 13 May 2013 07:46:45 +0200

On 05/10/2013 09:27 PM, Baruch Even wrote:
> On Fri, May 10, 2013 at 11:18 PM, Hannes Reinecke <hare@xxxxxxx> wrote:
>> On 05/10/2013 07:51 PM, Baruch Even wrote:
>>>
>>> The error handling I have in mind (admittedly, not fully thought out)
>>> should work for both FC and SAS. Currently the error recovery
>>> progresses at the host level regardless of if the errors are on one
>>> device or all of them, it also stops the IOs on all devices and LUNs.
>>> It would be nice if that was taken into account. My ideas may be more
>>> suitable to the environment I work in (enterprise storage devices
>>> rather than hosts) but I believe the same approach would benefit the
>>> hosts as well.
>>>
>>> It would be interesting to see what approach the new error handling will
>>> take.
>>>
>> So, my general idea is this:
>>
>> 1) Send command aborts from scsi_times_out(). There is no requirement
>>    on stopping I/O on the host simply because a single command times
>>    out. And as scsi_times_out() is run from a separate thread anyway
>>    we should be able to send ABORT TASK TMFs without a problem
>> 2) Modify recovery sequence.
>>    One of the major pitfalls of the current scsi_eh is that it
>>    spills over onto unrelated LUNs for higher levels. So for the
>>    new EH we should be using a sequence of
>>    - ABORT TASK
>>    - ABORT TASK SET
>>    - (Terminate I_T nexus)
>>    - (Host reset)
>>    'Terminate I_T nexus' for FibreChannel is equivalent to a LOGO.
>>    'Host reset' is the current host reset function.
>> 3) Finegrained recovery setting.
>>    There is no need to stop the entire host when doing a recovery;
>>    it should be sufficient to stop I/O to the unit
>>    (LUN, I_T nexus, host) when the error recovery is at the
>>    respective level.
> 
> This looks great and much in line with what I'm thinking.
> 
> What about not going to the higher level if not everything at that
> level had failed?
> I mean that if at the target not all LUNs failed it will be quite
> troublesome to other LUNs if I-T-Nexus is terminated and that at the
> host level if there are still targets that are functioning it will
> kill them too to reset the host.
> 

True. But and the end of the day, we _do_ want to recover the failed
LUN. If we were to disable that faulty LUN and continue running with
the others we won't have a chance of _ever_ recovering that one LUN.

Plus we have to keep in mind that the attempted error recovery did
not succeed for totally unrelated issues (ie sending a ABORT TASK
SET when the link is down). So we basically _have_ to escalate it
to the next level. Even though that will mean to stop I/O to other,
hitherto unaffected instances.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@xxxxxxx			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html