Re: [PATCH] block: Fix bug in runtime-resume handling

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Sat, 29 Aug 2020 14:56:53 -0400

On Sat, Aug 29, 2020 at 06:33:26PM +0200, Martin Kepplinger wrote:
> On 29.08.20 17:26, Alan Stern wrote:
> > Hmmm.  I'm wondering about something you wrote back in June 
> > (https://marc.info/?l=linux-scsi&m=159345778431615&w=2):
> > 
> > 	blk_queue_enter() always - especially when sd is runtime 
> > 	suspended and I try to mount as above - sets success to be true 
> > 	for me, so never continues down to bkl_pm_request_resume(). All 
> > 	I see is "PM: Removing info for No Bus:sda1".
> > 
> > blk_queue_enter() would always set success to be true because pm 
> > (derived from the BLK_MQ_REQ_PREEMPT flag) is true.  But why was the 
> > BLK_MQ_REQ_PREEMPT flag set?  In other words, where was 
> > blk_queue_enter() called from?
> > 
> > Can you get a stack trace (i.e., call dump_stack()) at exactly this 
> > point, that is, when pm is true and q->rpm_status is RPM_SUSPENDED?  Or 
> > do you already know the answer?
> > 
> >
> 
> I reverted any scsi/block out-of-tree fixes for this.
> 
> when I try to mount, pm is TRUE (BLK_MQ_REQ_PREEMT set) and that's the
> first stack trace I get in this condition, inside of blk_queue_enter():
> 
> There is more, but I don't know if that's interesting.
> 
> [   38.642202] CPU: 2 PID: 1522 Comm: mount Not tainted 5.8.0-1-librem5 #487
> [   38.642207] Hardware name: Purism Librem 5r3 (DT)
> [   38.642213] Call trace:
> [   38.642233]  dump_backtrace+0x0/0x210
> [   38.642242]  show_stack+0x20/0x30
> [   38.642252]  dump_stack+0xc8/0x128
> [   38.642262]  blk_queue_enter+0x1b8/0x2d8
> [   38.642271]  blk_mq_alloc_request+0x54/0xb0
> [   38.642277]  blk_get_request+0x34/0x78
> [   38.642286]  __scsi_execute+0x60/0x1c8
> [   38.642291]  scsi_test_unit_ready+0x88/0x118
> [   38.642298]  sd_check_events+0x110/0x158
> [   38.642306]  disk_check_events+0x68/0x188
> [   38.642312]  disk_clear_events+0x84/0x198
> [   38.642320]  check_disk_change+0x38/0x90
> [   38.642325]  sd_open+0x60/0x148
> [   38.642330]  __blkdev_get+0xcc/0x4c8
> [   38.642335]  __blkdev_get+0x278/0x4c8
> [   38.642339]  blkdev_get+0x128/0x1a8
> [   38.642345]  blkdev_open+0x98/0xb0
> [   38.642354]  do_dentry_open+0x130/0x3c8
> [   38.642359]  vfs_open+0x34/0x40
> [   38.642366]  path_openat+0xa30/0xe40
> [   38.642372]  do_filp_open+0x84/0x100
> [   38.642377]  do_sys_openat2+0x1f4/0x2b0
> [   38.642382]  do_sys_open+0x60/0xa8
> (...)
> 
> and of course it doesn't work and /dev/sda1 disappears, see the initial
> discussion that led to your fix.

Great!  That's exactly what I was looking for, thank you.

Bart, this is a perfect example of the potential race I've been talking 
about in the other email thread.  Suppose thread 0 is carrying out a 
runtime suspend of a SCSI disk and at the same time, thread 1 is opening 
the disk's block device (as we see in the stack trace here).  Then we 
could have the following:

	Thread 0		Thread 1
	--------		--------
	Start runtime suspend
	blk_pre_runtime_suspend calls
	  blk_set_pm_only and sets
	  q->rpm_status to RPM_SUSPENDING

				Call sd_open -> ... -> scsi_test_unit_ready
				  -> __scsi_execute -> ...
				  -> blk_queue_enter
				Sees BLK_MQ_REQ_PREEMPT set and
				  RPM_SUSPENDING queue status, so does 
				  not postpone the request

	blk_post_runtime_suspend sets
	  q->rpm_status to RPM_SUSPENDED
	The drive goes into runtime suspend

				Issues the TEST UNIT READY request
				Request fails because the drive is suspended

One way to avoid this race is mutual exclusion: We could make sd_open 
prevent the drive from being runtime suspended until it returns.  
However I don't like this approach; it would mean tracking down every 
possible pathway to __scsi_execute and making sure that runtime suspend 
is blocked.

A more fine-grained approach would be to have __scsi_execute itself call 
scsi_autopm_get/put_device whenever the rq_flags argument doesn't 
contain RQF_PM.  This way we wouldn't have to worry about missing any 
possiible pathways.  But it relies on an implicit assumption that 
__scsi_execute is the only place where the PREEMPT flag gets set.

A third possibility is the approach I outlined before, adding a 
BLK_MQ_REQ_PM flag.  But to avoid the deadlock you pointed out, I would 
make blk_queue_enter smarter about whether to postpone a request.  The 
logic would go like this:

	If !blk_queue_pm_only:
		Allow
	If !BLK_MQ_REQ_PREEMPT:
		Postpone
	If q->rpm_status is RPM_ACTIVE:
		Allow
	If !BLK_MQ_REQ_PM:
		Postpone
	If q->rpm_status is RPM_SUSPENDED:
		Postpone
	Else:
		Allow

The assumption here is that the PREEMPT flag is set whenever the PM flag 
is.

I believe either the second or third possibility would work.  The second 
looks to be the simplest

What do you think?

Alan Stern