Re: [PATCH] scsi device recovery

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Wed, 12 Dec 2007 10:59:36 -0500

On Wed, 2007-12-12 at 15:36 +0100, Bernd Schubert wrote:
> On Wednesday 12 December 2007 14:39:27 Matthew Wilcox wrote:
> > On Wed, Dec 12, 2007 at 01:54:14PM +0100, Bernd Schubert wrote:
> > > below is a patch introducing device recovery, trying to prevent i/o
> > > errors when a DID_NO_CONNECT or SOFT_ERROR does happen.
> >
> > Why doesn't the regular scsi_eh do what you need?
> 
> First of all, it is presently simply not called when the two errors above do 
> happen. This could be changed, of course.

Erm, I think you'll find the error handler does activate on
DID_SOFT_ERROR.  It causes a retry via the eh.  DID_NO_CONNECT is an
immediate error with no eh intervention because it means that the target
went away.  Handling this as a retryable error isn't an option because
it will interfere with hotplug.

> Secondly, I think scsi_eh is in most cases doing too much. We are fighting 
> with flaky Infortrend boxes here, and scsi_eh sometimes manages to crash 
> their scsi channels. In most cases it is sufficient to stall any io to the 
> device and then to resume.

But that's basically the default behaviour of the error handler (stall
then resume).

> For most scsi devices one probably doesn't need a suspend time or it can be 
> very small, this still needs to become configurable via sysfs.

You mean a wait time beyond what the error handler currently does
(basically it waits for the quiesce, begins error handling and then
sends a test unit ready when it finishes before restarting).

> Thirdly, scsi_eh doesn't give up, in most cases, when the scsi channel of a 
> Infortrend box crashed, it tried forever to recover.
> To improve this is still on my todo list.

Could you send traces for this.  I thought the error handler had been
fixed over the last few years always to terminate.  If there's a case
where it doesn't, this needs fixing.

James

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html