Re: [PATCH v6 9/9] block, scsi: Make SCSI device suspend and resume work reliably

Bart Van Assche <Bart.VanAssche@xxxxxxx> · Mon, 9 Oct 2017 21:16:53 +0000

On Sat, 2017-10-07 at 12:33 +0800, Ming Lei wrote:
> On Wed, Oct 04, 2017 at 05:01:10PM -0700, Bart Van Assche wrote:
> > It is essential during suspend and resume that neither the filesystem
> > state nor the filesystem metadata in RAM changes. This is why while
> > the hibernation image is being written or restored that SCSI devices
> 
> quiesce isn't used only for suspend and resume, And the issue isn't
> suspend/resume specific too. So please change the title/commit log
> as sort of 'make SCSI quiesce more reliable/safe'. 

OK, I will modify the patch title.

> > -		if (percpu_ref_tryget_live(&q->q_usage_counter))
> > -			return 0;
> > +		if (percpu_ref_tryget_live(&q->q_usage_counter)) {
> > +			/*
> > +			 * The code that sets the PREEMPT_ONLY flag is
> > +			 * responsible for ensuring that that flag is globally
> > +			 * visible before the queue is unfrozen.
> > +			 */
> > +			if (preempt || !blk_queue_preempt_only(q)) {
> 
> PREEMPT_ONLY flag is checked without RCU read lock held, so the
> synchronize_rcu() may just wait for completion of pre-exit
> percpu_ref_tryget_live(), which can be reordered with the
> reading on blk_queue_preempt_only().

OK, I will add rcu_read_lock_sched/unlock_sched() calls around this code.

> >  int
> >  scsi_device_quiesce(struct scsi_device *sdev)
> >  {
> > +	struct request_queue *q = sdev->request_queue;
> >  	int err;
> >  
> > +	blk_mq_freeze_queue(q);
> > +	if (blk_set_preempt_only(q)) {
> > +		blk_mq_unfreeze_queue(q);
> > +		return -EINVAL;
> > +	}
> 
> This way is wrong, if blk_set_preempt_only() returns true
> it means the queue has been in PREEMPT_ONLY already,
> and failing scsi_device_quiesce() can break suspend/resume or
> sending SCSI domain validation command.

That's an interesting comment but I propose that what you suggest is implemented
through a separate patch. The above code preserves the existing behavior, namely
that scsi_device_quiesce() returns an error code if called when a SCSI device has
already been quiesced. See also scsi_device_set_state(). BTW, I don't think that
it is likely that this will occur. The only code other than the power management
code I know of that sets the SCSI QUIESCE state is the SCSI sysfs state store
method and the SCSI parallel code. I don't know any software that sets the SCSI
QUIESCE state through sysfs. And I'm not sure the SCSI parallel code has a
significant number of users.

> > +	/*
> > +	 * Ensure that the effect of blk_set_preempt_only() will be visible
> > +	 * for percpu_ref_tryget() callers that occur after the queue
> > +	 * unfreeze. See also https://lwn.net/Articles/573497/.
> > +	 */
> > +	synchronize_rcu();
> 
> This synchronize_rcu may be saved if we set the PREEMPT_ONLY flag
> before freezing queue since blk_mq_freeze_queue() may implicate
> one synchronize_rcu().

That's an interesting comment. I will look further into this.

> > @@ -2961,8 +2970,10 @@ void scsi_device_resume(struct scsi_device *sdev)
> >  	 */
> >  	mutex_lock(&sdev->state_mutex);
> >  	if (sdev->sdev_state == SDEV_QUIESCE &&
> > -	    scsi_device_set_state(sdev, SDEV_RUNNING) == 0)
> > +	    scsi_device_set_state(sdev, SDEV_RUNNING) == 0) {
> > +		blk_clear_preempt_only(sdev->request_queue);
> >  		scsi_run_queue(sdev->request_queue);
> > +	}
> >  	mutex_unlock(&sdev->state_mutex);
> 
> scsi_run_queue() can be removed, and blk_clear_preempt_only() needn't
> to be run with holding sdev->state_mutex, just like in quiesce path.

I will look into removing the scsi_run_queue() call. But I prefer to keep
the blk_clear_preempt_only() inside the critical section because that call
doesn't sleep and because it changes state information that is related to
the SCSI state.

Bart.