Re: [PATCH 1/5] SCSI scanning and removal fixes

James Bottomley <James.Bottomley@xxxxxxxxxxxx> · Thu, 08 Sep 2005 15:15:41 -0500

On Thu, 2005-09-08 at 14:58 -0400, Alan Stern wrote:
> On Thu, 8 Sep 2005, James Bottomley wrote:
> > Actually, no, that's why we have the parallel EH states ... let me put
> > in the events that trigger state transitions so you can see what
> > happens:
> > 
> > 
> >         EH thread finishes
> > 	<---------------
> > 
> >         EH thread begins
> >         ---------------->
> > 
> > 
> >          <--------- 
> >   running ---------> recovery
> >     |                   |
> >     |                   | scsi_remove_host() called
> >     v   <----------     v
> >   cancel ----------> recover/cancel
> >     |                   |
> >     |                   | scsi_remove_host finishes visibility removal
> >     v                   v
> >    del <------------ recover/del
> > 
> > So the EH is allowed to activate in either running or cancel states, but
> > goes through its own state transition eventually coming back to del when
> > it finishes.  Once the EH gets into recover/cancel it can never
> > transition back to running.
> 
> Why allow cancel -> recover/cancel?  Once the device is in the cancel 
> state, there isn't anything useful the error handler can accomplish, is 
> there?  Failed or timed-out commands should simply return an error.
> 
> And if you accept that, then what point is there in distinguishing between 
> cancel and recover/cancel?  As far as I can see, the only significant 
> difference is that in recover/cancel the error handler is running (but not 
> accomplishing anything).  Is this related to 1) below?

Yes, a state machine like I've shown can only be in one state at any
given time.  The problem is what happens if the error handler is running
when you try to remove the device.  We can reach in and try to stop it,
but under a state model we'd have to wait until the error handler put
the device back to running unless we allow a set of state transitions
that the error handler can go through in parallel with the removal
(which is the model I've shown above).

The idea of allowing the error handler to start up in cancel is that not
every device removal is a forced removal, so you want to permit the
shutdown commands (like sync cache) and consequent error handling on
those commands.

> > It's just nasty on two counts:
> > 
> > 1) we have an incorrect bifurcation in the state model and
> > 2) we never actually enforce any of the state transitions.
> 
> Can you explain 1) more fully?  I don't really understand what you're 
> getting at.

Does the explanation above cover this?

> As for 2), what do you mean?  In 2.6.13, scsi_device_set_state does not 
> change the state if the transition is illegal.

Yes, but we never act on the return value.  if the state is illegal, we
shouldn't be taking actions believing we're in that state, so we should
always be checking the returns.

James

-
: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html