On 05/10/2013 09:27 PM, Baruch Even wrote: > On Fri, May 10, 2013 at 11:18 PM, Hannes Reinecke <hare@xxxxxxx> wrote: >> On 05/10/2013 07:51 PM, Baruch Even wrote: >>> >>> The error handling I have in mind (admittedly, not fully thought out) >>> should work for both FC and SAS. Currently the error recovery >>> progresses at the host level regardless of if the errors are on one >>> device or all of them, it also stops the IOs on all devices and LUNs. >>> It would be nice if that was taken into account. My ideas may be more >>> suitable to the environment I work in (enterprise storage devices >>> rather than hosts) but I believe the same approach would benefit the >>> hosts as well. >>> >>> It would be interesting to see what approach the new error handling will >>> take. >>> >> So, my general idea is this: >> >> 1) Send command aborts from scsi_times_out(). There is no requirement >> on stopping I/O on the host simply because a single command times >> out. And as scsi_times_out() is run from a separate thread anyway >> we should be able to send ABORT TASK TMFs without a problem >> 2) Modify recovery sequence. >> One of the major pitfalls of the current scsi_eh is that it >> spills over onto unrelated LUNs for higher levels. So for the >> new EH we should be using a sequence of >> - ABORT TASK >> - ABORT TASK SET >> - (Terminate I_T nexus) >> - (Host reset) >> 'Terminate I_T nexus' for FibreChannel is equivalent to a LOGO. >> 'Host reset' is the current host reset function. >> 3) Finegrained recovery setting. >> There is no need to stop the entire host when doing a recovery; >> it should be sufficient to stop I/O to the unit >> (LUN, I_T nexus, host) when the error recovery is at the >> respective level. > > This looks great and much in line with what I'm thinking. > > What about not going to the higher level if not everything at that > level had failed? > I mean that if at the target not all LUNs failed it will be quite > troublesome to other LUNs if I-T-Nexus is terminated and that at the > host level if there are still targets that are functioning it will > kill them too to reset the host. > True. But and the end of the day, we _do_ want to recover the failed LUN. If we were to disable that faulty LUN and continue running with the others we won't have a chance of _ever_ recovering that one LUN. Plus we have to keep in mind that the attempted error recovery did not succeed for totally unrelated issues (ie sending a ABORT TASK SET when the link is down). So we basically _have_ to escalate it to the next level. Even though that will mean to stop I/O to other, hitherto unaffected instances. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@xxxxxxx +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html