Re: [patch 1/2] zfcp: Fix race during ERP thread shutdown

Martin Peschke <mp3@xxxxxxxxxx> · Wed, 12 Mar 2008 15:13:47 +0100

On Tue, 2008-03-11 at 17:43 +0100, Heiko Carstens wrote:
> > --- a/drivers/s390/scsi/zfcp_erp.c	2008-03-10 12:34:48.000000000 +0100
> > +++ b/drivers/s390/scsi/zfcp_erp.c	2008-03-10 16:05:26.000000000 +0100
> > @@ -1002,7 +1002,7 @@ zfcp_erp_thread_setup(struct zfcp_adapte
> >  					    &adapter->status));
> >  		debug_text_event(adapter->erp_dbf, 5, "a_thset_ok");
> >  	}
> > -
> > +	atomic_set_mask(ZFCP_STATUS_ADAPTER_ERP_ACCEPT, &adapter->status);
> >  	return (retval < 0);
> 
> This way the flag will be set even if the creation of the erp thread failed.

You are right.

This used to be correct with all the cleanups pending in our queue
applied. Looks like this flaw creeped in when the bug fix patch was
moved to the head of queue.

> It should be done somewhere within zfcp_erp_thread() instead.

Setting the flag in the setup function is fine, as long as we make sure
the thread is really up and running. I'd prefer to have it here because
of its symmetry with zfcp_erp_thread_kill().

> Besides that setting the flag must be done while holding the erp_lock.
> Otherwise zfcp_erp_action_enqueue might check the flag, other cpu clears
> the flag right after that, and then enqueueing continues while the flag is
> not set.

Correct.

zfcp_erp_wait() might fail to wait for that new action to finish,
because the PENDING flag is set later in the action enqueuing process.
But we want zfcp_erp_wait() to make sure that recovery is settled, so
that we can go on killing the thread.

> >  }
> > 
> > @@ -1025,6 +1025,8 @@ zfcp_erp_thread_kill(struct zfcp_adapter
> >  {
> >  	int retval = 0;
> > 
> > +	atomic_clear_mask(ZFCP_STATUS_ADAPTER_ERP_ACCEPT, &adapter->status);
> > +	zfcp_erp_wait(adapter);
> >  	atomic_set_mask(ZFCP_STATUS_ADAPTER_ERP_THREAD_KILL, &adapter->status);
> >  	up(&adapter->erp_ready_sem);
> > 
> > @@ -2940,8 +2942,7 @@ zfcp_erp_action_enqueue(int action,
> >  	 * efficient.
> >  	 */
> > 
> > -	if (!atomic_test_mask(ZFCP_STATUS_ADAPTER_ERP_THREAD_UP,
> > -			      &adapter->status))
> > +	if (!atomic_test_mask(ZFCP_STATUS_ADAPTER_ERP_ACCEPT, &adapter->status))
> >  		return -EIO;
> > 
> >  	debug_event(adapter->erp_dbf, 4, &action, sizeof (int));
> > --- a/drivers/s390/scsi/zfcp_ccw.c	2008-03-10 12:34:48.000000000 +0100
> > +++ b/drivers/s390/scsi/zfcp_ccw.c	2008-03-10 16:05:26.000000000 +0100
> > @@ -198,7 +198,6 @@ zfcp_ccw_set_offline(struct ccw_device *
> >  	down(&zfcp_data.config_sema);
> >  	adapter = dev_get_drvdata(&ccw_device->dev);
> >  	zfcp_erp_adapter_shutdown(adapter, 0);
> > -	zfcp_erp_wait(adapter);
> >  	zfcp_erp_thread_kill(adapter);
> 
> And as a sidenote, if the scenario this patch is supposed to fix can really
> happen, how can you make sure the adapter is still down before the erp thread
> gets killed? Some action could have been enqueued and finished which reopened
> the adapter after zfcp_erp_adapter_shutdown and before zfcp_erp_thread_kill.
> Just wondering... :)

You acting the innocent... :)

As to question one - Is there a way for recovery being triggered when we
are about to kill the recovery thread during the adapter offlining
procedure:
I am not sure. There are many potential triggers for error recovery.
Some code path might be blocked when an adapter has been shut down
(zfcp_fsf.c). I could review each one of them, and maybe add a comment
about the assumption in the zfcp_ccw.c. But I don't like the idea of
some one adding another recovery trigger somewhere in the driver in the
future, not taking heed of this assumption.

That is why, I'd rather have the error recovery shutdown procedure be
airtight and self-dependent in the first place and have it not rely on
shaky assumptions about the code which uses it.

As to question two - Do we know for sure the adapter won't be brought
back up again?
With or without my patch, the issue would be there.
Actually my patch is useless as long as it only concernes itself with
bringing recovery to a termination, while disregarding the fact that
recovery needs to be brought to a termination in a way that guarantees
that adapter operation has been stopped.

Thank you,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html