On Sat, 2017-10-07 at 12:33 +0800, Ming Lei wrote: > On Wed, Oct 04, 2017 at 05:01:10PM -0700, Bart Van Assche wrote: > > It is essential during suspend and resume that neither the filesystem > > state nor the filesystem metadata in RAM changes. This is why while > > the hibernation image is being written or restored that SCSI devices > > quiesce isn't used only for suspend and resume, And the issue isn't > suspend/resume specific too. So please change the title/commit log > as sort of 'make SCSI quiesce more reliable/safe'. OK, I will modify the patch title. > > - if (percpu_ref_tryget_live(&q->q_usage_counter)) > > - return 0; > > + if (percpu_ref_tryget_live(&q->q_usage_counter)) { > > + /* > > + * The code that sets the PREEMPT_ONLY flag is > > + * responsible for ensuring that that flag is globally > > + * visible before the queue is unfrozen. > > + */ > > + if (preempt || !blk_queue_preempt_only(q)) { > > PREEMPT_ONLY flag is checked without RCU read lock held, so the > synchronize_rcu() may just wait for completion of pre-exit > percpu_ref_tryget_live(), which can be reordered with the > reading on blk_queue_preempt_only(). OK, I will add rcu_read_lock_sched/unlock_sched() calls around this code. > > int > > scsi_device_quiesce(struct scsi_device *sdev) > > { > > + struct request_queue *q = sdev->request_queue; > > int err; > > > > + blk_mq_freeze_queue(q); > > + if (blk_set_preempt_only(q)) { > > + blk_mq_unfreeze_queue(q); > > + return -EINVAL; > > + } > > This way is wrong, if blk_set_preempt_only() returns true > it means the queue has been in PREEMPT_ONLY already, > and failing scsi_device_quiesce() can break suspend/resume or > sending SCSI domain validation command. That's an interesting comment but I propose that what you suggest is implemented through a separate patch. The above code preserves the existing behavior, namely that scsi_device_quiesce() returns an error code if called when a SCSI device has already been quiesced. See also scsi_device_set_state(). BTW, I don't think that it is likely that this will occur. The only code other than the power management code I know of that sets the SCSI QUIESCE state is the SCSI sysfs state store method and the SCSI parallel code. I don't know any software that sets the SCSI QUIESCE state through sysfs. And I'm not sure the SCSI parallel code has a significant number of users. > > + /* > > + * Ensure that the effect of blk_set_preempt_only() will be visible > > + * for percpu_ref_tryget() callers that occur after the queue > > + * unfreeze. See also https://lwn.net/Articles/573497/. > > + */ > > + synchronize_rcu(); > > This synchronize_rcu may be saved if we set the PREEMPT_ONLY flag > before freezing queue since blk_mq_freeze_queue() may implicate > one synchronize_rcu(). That's an interesting comment. I will look further into this. > > @@ -2961,8 +2970,10 @@ void scsi_device_resume(struct scsi_device *sdev) > > */ > > mutex_lock(&sdev->state_mutex); > > if (sdev->sdev_state == SDEV_QUIESCE && > > - scsi_device_set_state(sdev, SDEV_RUNNING) == 0) > > + scsi_device_set_state(sdev, SDEV_RUNNING) == 0) { > > + blk_clear_preempt_only(sdev->request_queue); > > scsi_run_queue(sdev->request_queue); > > + } > > mutex_unlock(&sdev->state_mutex); > > scsi_run_queue() can be removed, and blk_clear_preempt_only() needn't > to be run with holding sdev->state_mutex, just like in quiesce path. I will look into removing the scsi_run_queue() call. But I prefer to keep the blk_clear_preempt_only() inside the critical section because that call doesn't sleep and because it changes state information that is related to the SCSI state. Bart.