Re: [PATCH] scsi device recovery

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Fri, 14 Dec 2007 09:35:01 -0500

On Fri, 2007-12-14 at 13:04 +0100, Bernd Schubert wrote:
> Hello James,
> 
> On Thursday 13 December 2007 15:18:33 James Bottomley wrote:
> > On Wed, 2007-12-12 at 18:54 +0100, Bernd Schubert wrote:
> > > [Hmm, resending since mail after more than 30min still not on the ML,
> > > maybe the attachment was too large? I have uploaded the log to
> > > http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/scsi/kern.log.1]
> > >
> > > On Wednesday 12 December 2007 16:59:36 James Bottomley wrote:
> > > > On Wed, 2007-12-12 at 15:36 +0100, Bernd Schubert wrote:
> > > > > On Wednesday 12 December 2007 14:39:27 Matthew Wilcox wrote:
> > > > > > On Wed, Dec 12, 2007 at 01:54:14PM +0100, Bernd Schubert wrote:
> > > > > > > below is a patch introducing device recovery, trying to prevent
> > > > > > > i/o errors when a DID_NO_CONNECT or SOFT_ERROR does happen.
> > > > > >
> > > > > > Why doesn't the regular scsi_eh do what you need?
> > > > >
> > > > > First of all, it is presently simply not called when the two errors
> > > > > above do happen. This could be changed, of course.
> > > >
> > > > Erm, I think you'll find the error handler does activate on
> > > > DID_SOFT_ERROR.  It causes a retry via the eh.  DID_NO_CONNECT is an
> > >
> > > Dec  7 23:48:45 beo-96 kernel: [94605.297924] sd 2:0:5:0: [sdd] Result:
> > > hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
> > > Dec  7 23:48:45 beo-96 kernel: [94605.297932] end_request: I/O error, dev
> > > sdd, sector 7706802052
> > > Dec  7 23:48:45 beo-96 kernel: [94605.297937] raid5:md5: read error not
> > > correctable (sector 871932472 on sdd3).
> >
> > This is some type of ioc internal error.  What we do on DID_SOFT_ERROR
> > is retry for the usual number of times up to the timeout limit.
> > Unfortunately, the retries are fixed at SD_MAX_RETRIES in sd.c.  Without
> > diagnosing what's going wrong in the fusion, it's impossible to say if
> > this is reasonable, but your fusion is signalling ioc errors (firmware
> > errors).
> 
> besides this seems to be a fusion driver or firmware problem, I still think eh 
> is not activated for this error. I'm not absulutely sure, but I think with my 
> patch deh and later on eh would be triggered, wouldn't it?

the full eh machinery, by design, isn't activated for a simple retry.
If you look in scsi_lib.c:scsi_softirq_done() you'll see the processing
of the outcome of scsi_decide_disposision() (DID_SOFT_ERROR comes out of
here with NEEDS_RETRY, providing there are retries left).  Right at the
moment, this means that the retry is absolutely immediate, so you
probably run through all of the retries before firmware recovery even
has time to activate.  I'd be amenable to giving it an ADD_TO_MLQUEUE
type return (provided it still increments retries) which will cause a
pause in the resubmission (until either a command returns or io pressure
builds up in the block layer).

> >
> > > Full log attached.
> > >
> > > > immediate error with no eh intervention because it means that the
> > > > target went away.  Handling this as a retryable error isn't an option
> > > > because it will interfere with hotplug.
> > >
> > > Then we need a sysfs flag one can set to manually enable eh for these
> > > devices on DID_NO_CONNECT.
> >
> > No, because that will seriously damage a lot of other systems.
> 
> How would it, if we create a device specific sysfs parameter defaulting to 
> off? If you think users could activate it by accident, we could also print a 
> big warning when the paramter is read from userspace.
> Furthermore, as far as I did understand you, DID_NO_CONNECT is only required 
> for hotplugging. But real scsi doesn't do automatic hotplugging, does it? 

Yes, it does.  Most modern busses are hot plug aware and use
DID_NO_CONNECT to signal target went away.  Even some SPI frames are
quasi hotplug aware.

> One 
> always needs to do it manually, e.g. with scsiadd or similar tools. So is 
> DID_NO_CONNECT really required for native scsi? If not, we also could make 
> the scsi-drivers to set a flag to activate eh on DID_NO_CONNECT.

Just grep through the mid layer ... you'll see we use DID_NO_CONNECT on
a host of other error conditions to force an immediate error as well.

> >
> > The DID_NO_CONNECT looks to be a genuine reselection issue caused by a
> > device out of spec on the bus.  The SPI standard says a device should
> > respond in 250ms, which is what most HBA's take as the default selection
> > timeout.  I'd say for the device you have, you need to increase this.
> > Unfortunately doing this for the fusion is some type of mode page
> > setting, I think, but I don't have the doc in front of me.  I'd be
> > amenable to putting the selection timeout as a parameter in the spi
> > transport class, since others might find it valuable occasionally to
> > control.
> 
> Its of course optimal to fix the real cause of our problems. I have ask 
> Infortrend now which value should be used for their devices.
> 
> Eric, I would be greatful if you could point me to the code fragment using or 
> setting the respond timeout.
> 
> 
> [...]
> 
> > > I'm attaching the syslog, this is 2.6.22 + additional printks,
> > > dump_stack()'s and msleep()'s.
> > > At 03:59:36 the system finally went into wait_for_completion(), similar
> > > to the "everything in wait_for_completion, what is my system doing?"
> > > thread.
> >
> > This looks like a genuine bug.  I missed the thread, since my email
> > system went off line while I was on holiday for two weeks.  The symptoms
> > look to be lost commands, but I can't see why from the traces.  There's
> > a known bug where we can hang in domain validation because of a resource
> > starvation issue, but I know of none where everything hangs just after
> > error recovery completes.
> 
> Since still not much happend to solve this bug, shall I create a bugzilla 
> entry?

Sure ... on further analysis, it is the fusion DV resource starvation
issue.  The email thread is here:

http://marc.info/?t=118039577800004

James

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html