On Wednesday 03 December 2008 17:29:22 James Bottomley wrote: > On Wed, 2008-12-03 at 17:00 +0100, Bernd Schubert wrote: > > On Wednesday 03 December 2008 16:16:31 James Bottomley wrote: > > > On Wed, 2008-12-03 at 13:17 +0100, Bernd Schubert wrote: > > > > On Wednesday 26 November 2008 19:47:47 James Bottomley wrote: > > > > > On Wed, 2008-11-26 at 18:46 +0100, Bernd Schubert wrote: > > > > > > Activate the error handler if DID_SOFT_ERROR failed to often, but > > > > > > only for commands which have a scmd->allowed > 1. > > > > > > Also make a function out of a goto-block. > > > > > > > > > > What is the rationale for this? It really doesn't look right since > > > > > DID_SOFT_ERROR is supposed to be for temporary out of resource > > > > > conditions in the HBA driver ... activating the error handler isn't > > > > > really going to fix this because the eh is taking us through a > > > > > state model for device conditions, which DID_SOFT_ERROR shouldn't > > > > > be. > > > > > > > > What do you suggest instead of? Just returning an I/O error without > > > > even to try to recover the device isn't nice. > > > > > > it doesn't do that ... it retries up to the retry limit before failing > > > the command. There is an argument that we should treat this as other > > > temporary resource conditions like BUSY and QUEUE_FULL, so return > > > ADD_TO_MLQUEUE. On the other hand, DID_REQUEUE already does that, so > > > this would lose the only unconditional DID_ code going generically > > > through the retry path. > > > > > > > > If you just need a DID_FAIL to activate the eh, it can be added > > > > > without changing the meaning of DID_SOFT_ERROR. > > > > > > > > > > Also, you changed the return to make it device blocking (which also > > > > > doesn't look right) but didn't document that in the change log. > > > > > > > > Last year you suggested to switch from NEEDS_RETRY to ADD_TO_MLQUEUE > > > > > > > > http://www.mail-archive.com/linux-scsi%40vger.kernel.org/msg12475.htm > > > >l > > > > > > > > When I wrote the patch documentation, I already forgot about it, > > > > sorry. Unfortunately, it didn't help much for our devices. So I made > > > > it to activate the eh only, if it fails too often. With activated eh, > > > > devices sometimes can be recovered. But I'm certainly grateful for > > > > any hints to further improve recovery and to prevent i/o errors. > > > > > > Well, what exactly is the problem? changing to ADD_TO_MLQUEUE will > > > retry intermittently up to the command timeout. If activating the > > > error handler actually fixes the problem, then the driver was probably > > > wrongly returning DID_SOFT_ERROR. > > > > Well, I have certainly no experience with hardware/driver programming but > > all drivers I have looked at, seem to use DID_SOFT_ERROR as something > > like DID_UNKNOWN_ERROR. > > DID_SOFT_ERROR means specifically that the driver ran into a resource or > other soft (as in retryable) error. > > > I certainly do not insist on using ADD_TO_MLQUEUE instead of NEEDS_RETRY > > and I will happlily modify the patch, if you think NEEDS_RETRY is better. > > But I would really prefer to try to recover the device when > > DID_SOFT_ERROR came up. I mean without the eh we get an I/O error anyway. > > So as last attempt to try to do some device resets won't hurt, will it? > > Yes, it will on a SAN. For an error condition internal to the driver > what is the point of causing external disruptions to the devices? Hrm, right. For SAN it is really a problem :( > > > And I have really seen some successful mpt fusion device recoveries. > > Well, my guess there would be the internal sequencer is hosed and the > host reset corrected it, is that right? In which case, there might be a > place where mpt fusion is returning DID_SOFT_ERROR incorrectly. Well, these DID_SOFT_ERRORs are not easily reproducibly. If they would, I already would have debugged it more in detail. Any idea what we can do? Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html