Re: disk restart failure after suspend

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Mon, 19 Oct 2009 09:42:04 -0400 (EDT)

On Sun, 18 Oct 2009, Stefan Richter wrote:

> > It seems to me that the restart always fails if the "rediscovered
> > device fw1" resp.  "firewire_sbp2: fw1.0: reconnected to LUN 0000"
> > message comes after the "[sdb] Starting disk" message.  That would
> > sound like an actual bug to me.
> 
> It is not a bug.  IEEE 1394 rediscovery and SBP-2 reconnect can become
> necessary anytime (and they do become necessary at /least/ once during
> PM resume), in no particular order with respect to SCSI request
> submission.  Our drivers (firewire-sbp2 mainly) need to be able to
> handle any order of such events.

Is it possible to delay returning from the device resume routine until
the rediscovery/reconnect has completed?  This is more or less how the
USB stack works.

> Interesting findings.
> 
> There are two independent places of the code that could possibly be
> improved to fix this issue:
> 
> a.)  sd's PM resume method:
> 
> 1.a)  sd_resume could gain this retry loop which you implemented.

This wouldn't be necessary if the transport was working before 
sd_resume got called.

> 1.b)  sd_resume (but probably not sd_suspend) could optimistically
> ignore any error return from sd_start_stop_device.  If the motor cannot
> be started immediately at resume, the SCSI core would try to start it
> later on when the disk is normally accessed.

This is probably a worthwhile idea in any case.

> My assumption here is that an error return from sd_resume causes the
> disk to become inaccessible (taken offline?).

No.  All it does is cause an error message to be printed in the system 
log.  But it's possible that a failure lower down in the SCSI stack has 
this effect.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html