Re: [PATCH] scsi: Allow error handling timeout to be specified

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 05/13/2013 04:40 PM, Jeremy Linton wrote:
> On 5/13/2013 12:46 AM, Hannes Reinecke wrote:
> 
>> True. But and the end of the day, we _do_ want to recover the failed LUN.
>> If we were to disable that faulty LUN and continue running with the others
>> we won't have a chance of _ever_ recovering that one LUN.
> 
> 	I don't buy this. Especially for FC devices, the vast majority of errors I see
> are related to zoning, SFP and cabling problems. Once one of those happens you
> tend to get a lot of shotgun debugging, which injects all kinds of
> further errors.	None of these errors are fixed by the linux error recovery paths.
> 
> 	That said, if the admin fixes something, for FC/SAS (and potentially others)
> you _WILL_ get notification that the device is online again.
> 

Well, yes, of course.
Sadly, these kind of errors tend to be very erratic and very hard to
diagnose. There simply is no way telling that the error you've had
is due to a bad cable or bad SFP.
Bad zoning is easy; then the device is simply not reachable anymore.

So for error recovery we first have to assume that the error is
fixable. And then we have a standard way of trying to fix this error.
The problem we have is that we lose all information about the error
once it's 'fixed' (ie after eh is done). Which is the main problem
with bad cabling: we're running the same sequence all over again,
without ever figuring out 'hey, I've done this already'.

sd.c has some _very_ limited support for this. But trying to
generalise things here will be _hard_.

So yeah, I see your point. In fact, I've been bitten by this, too.
But the error scenarios I've seen are far to complex to have them
modelled into something re-usable.

>> SET when the link is down). So we basically _have_ to escalate it to the
>> next level. Even though that will mean to stop I/O to other, hitherto
>> unaffected instances.
> 
> 	And a single failure, turns into performance bubbles and further errors on
> other devices. Particularly if the functional devices are stateful, and the
> error recovery mechanism isn't sufficiently intelligent about that state (see
> tape drives). Think about what happens when a marginal SFP on a target causes
> a device to repeatably drop off and reappear at some random point in the future.
> 
> 
> 	Anyway, It is possible to make a determination about the topology and make
> decisions about the likely-hood of any given portion being at fault. For
> example, if one lun on a target has failed and the remainder continue to work,
> then its unlikely that if abort and lun reset fail that anything higher up in
> the stack is going to succeed.
> 
Which is why I suggested 'ABORT TASK SET' instead of 'LUN reset'.
That will be restricted to the I_T_L nexus, and leave the rest of
the LUN alone (or so one hopes).

> 	I feel pretty strongly, at that point your better off providing good
> diagnostics about the failure and expecting user interaction rather than
> muddying the waters by causing other device interruptions. If the user tries
> everything and determines that a HBA reset is the right choice, provide that
> option, don't do it for them.
> 
> 	If every device attached to the HBA fails then resetting the HBA is a valid
> choice, not before. Same for I_T.
> 
Hmm. Really not sure.

Take the 'target not responding' case. (which is what triggered this
whole issue anyway). Say a target port went out to lunch and don't
respond to FC commands anymore.

With our current EH it'll take _ages_, but eventually the big hammer
hits (or the device comes back) and everything is back to normal again.
So LUN reset (or ABORT TASK SET) fails.
The other LUNs haven't reported an error. But how do you know
whether they are still okay? The other LUNs might simply be idle,
and no commands have been send to them.
So the state's still good. Do we reset the I_T nexus or not?

If we do, we would find that the entire rport doesn't respond, so
the devloss_tmo mechanism would trigger, and eventually the rport
will disappear and we're back on normal operation.

If we don't the LUN will be stuck forever, until someone actually
issues I/O to the other LUNs for that rport. And only when I/O is
issued to the _last_ LUN we'll decide to reset the I_T nexus.
Not a very appealing scenario.

And 'reset I_T nexus' should be a rather fast operation; with a bit
of luck the other rports wouldn't even notice.
I've had a prototype running which would just kick off the
dev_loss_tmo mechanism; that worked like a charm.
(Agreed, as James Smart indicated 'only by luck', but nevertheless)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@xxxxxxx			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux