On Wed, 2007-12-12 at 15:36 +0100, Bernd Schubert wrote: > On Wednesday 12 December 2007 14:39:27 Matthew Wilcox wrote: > > On Wed, Dec 12, 2007 at 01:54:14PM +0100, Bernd Schubert wrote: > > > below is a patch introducing device recovery, trying to prevent i/o > > > errors when a DID_NO_CONNECT or SOFT_ERROR does happen. > > > > Why doesn't the regular scsi_eh do what you need? > > First of all, it is presently simply not called when the two errors above do > happen. This could be changed, of course. Erm, I think you'll find the error handler does activate on DID_SOFT_ERROR. It causes a retry via the eh. DID_NO_CONNECT is an immediate error with no eh intervention because it means that the target went away. Handling this as a retryable error isn't an option because it will interfere with hotplug. > Secondly, I think scsi_eh is in most cases doing too much. We are fighting > with flaky Infortrend boxes here, and scsi_eh sometimes manages to crash > their scsi channels. In most cases it is sufficient to stall any io to the > device and then to resume. But that's basically the default behaviour of the error handler (stall then resume). > For most scsi devices one probably doesn't need a suspend time or it can be > very small, this still needs to become configurable via sysfs. You mean a wait time beyond what the error handler currently does (basically it waits for the quiesce, begins error handling and then sends a test unit ready when it finishes before restarting). > Thirdly, scsi_eh doesn't give up, in most cases, when the scsi channel of a > Infortrend box crashed, it tried forever to recover. > To improve this is still on my todo list. Could you send traces for this. I thought the error handler had been fixed over the last few years always to terminate. If there's a case where it doesn't, this needs fixing. James - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html