RE: [PATCH] scsi: storvsc: Fix a panic in the hibernation procedure

Dexuan Cui <decui@xxxxxxxxxxxxx> · Thu, 23 Apr 2020 18:29:30 +0000

> From: Bart Van Assche <bvanassche@xxxxxxx>
> Sent: Thursday, April 23, 2020 9:38 AM
> 
> On 4/23/20 12:04 AM, Dexuan Cui wrote:
> > It looks the sd suspend callbacks are only for the I/O from the disk, e.g.
> > from the file system that lives in some partition of some disk.
> >
> > The panic I'm seeing is not from sd. I think it's from a kernel thread
> > that tries to detect the status of the SCSI CDROM. This is the snipped
> > messages (the full version is at ...): here
> > the suspend callbacks of the sd, sr and scsi_bus_type.pm have been called,
> > and later the storvsc LLD's suspend callback is also called, but
> > sr_block_check_events() can still try to submit SCSI commands to storvsc:
> >
> > [   11.668741] sr 0:0:0:1: bus quiesce
> > [   11.668804] sd 0:0:0:0: bus quiesce
> > [   11.698082] scsi target0:0:0: bus quiesce
> > [   11.703296] scsi host0: bus quiesce
> > [   11.781730] hv_storvsc bf78936f-7d8f-45ce-ab03-6c341452e55d: noirq
> bus quiesce
> > [   11.796479] hv_netvsc dda5a2be-b8b8-4237-b330-be8a516a72c0: noirq
> bus quiesce
> > [   11.804042] BUG: kernel NULL pointer dereference, address:
> 0000000000000090
> > [   11.804996] Workqueue: events_freezable_power_ disk_events_workfn
> > [   11.804996] RIP: 0010:storvsc_queuecommand+0x261/0x714
> [hv_storvsc]
> > [   11.804996] Call Trace:
> > [   11.804996]  scsi_queue_rq+0x593/0xa10
> > [   11.804996]  blk_mq_dispatch_rq_list+0x8d/0x510
> > [   11.804996]  blk_mq_sched_dispatch_requests+0xed/0x170
> > [   11.804996]  __blk_mq_run_hw_queue+0x55/0x110
> > [   11.804996]  __blk_mq_delay_run_hw_queue+0x141/0x160
> > [   11.804996]  blk_mq_sched_insert_request+0xc3/0x170
> > [   11.804996]  blk_execute_rq+0x4b/0xa0
> > [   11.804996]  __scsi_execute+0xeb/0x250
> > [   11.804996]  sr_check_events+0x9f/0x270 [sr_mod]
> > [   11.804996]  cdrom_check_events+0x1a/0x30 [cdrom]
> > [   11.804996]  sr_block_check_events+0xcc/0x110 [sr_mod]
> > [   11.804996]  disk_check_events+0x68/0x160
> > [   11.804996]  process_one_work+0x20c/0x3d0
> > [   11.804996]  worker_thread+0x2d/0x3e0
> > [   11.804996]  kthread+0x10c/0x130
> > [   11.804996]  ret_from_fork+0x35/0x40
> >
> > It looks the issue is: scsi_bus_freeze() -> ... -> scsi_dev_type_suspend ->
> > scsi_device_quiesce() does not guarantee the device is totally quiescent:
> 
> During hibernation processes are frozen before devices are quiesced.
> freeze_processes() calls try_to_freeze_tasks() and that function in turn
> calls freeze_workqueues_begin() and freeze_workqueues_busy().
> freeze_workqueues_busy() freezes all freezable workqueues including
> system_freezable_power_efficient_wq, the workqueue from which
> check_events functions are called. Some time after freezable workqueues
> are frozen dpm_suspend(PMSG_FREEZE) is called. That last call triggers
> the pm_ops.freeze callbacks, including the pm_ops.freeze callbacks
> defined in the SCSI core.
> 
> The above trace seems to indicate that freezing workqueues has not
> happened before devices were frozen. How about doing the following to
> retrieve more information about what is going on?
> * Enable CONFIG_PM_DEBUG in the kernel configuration.
> * Run echo 1 > /sys/power/pm_print_times and echo 1 >
> /sys/power/pm_debug_messages before hibernation starts.
> 
> Bart.

Good point! My panic happens in the "restore" path, not the "save" path.

In the "save" path, as you described, it looks everything is done correctly.
BTW, freeze_processes() only freezes the userspace processes. After
hibernate() -> freeze_processes(), hibernate() -> hibernation_snapshot()
-> freeze_kernel_threads() also freezes the "freezable" kernel processes,
and then we call dpm_suspend(PMSG_FREEZE), and next we do a lot of
other things, and finally the system is powered off.

In the "restore" path of the hibernation:
1. We start the system and a fresh new kernel starts to run.

2. The 'new' kernel itself finishes the initialization and the 'init' script
in the initramfs starts to run. The 'init' script notices there is a saved
state of the previous 'old' kernel in some swap file/partition, so it won't
do a usual start-up; instead, the 'init' script runs a program called 'resume'.

3. The 'resume' program talks to the kernel via /sys/power/resume and
asks the kernel to do a resume-from-disk.

4. The kernel starts to run resume_store() -> software_resume ->
freeze_processes(), which freezes the userspace, but the "freezable"
kernel threads are not frozen!!!

So it looks the below patch also works for me:

--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -898,6 +898,11 @@ static int software_resume(void)
        error = freeze_processes();
        if (error)
                goto Close_Finish;
+
+       error = freeze_kernel_threads();
+       if (error)
+               goto Close_Finish;
+
        error = load_image_and_restore();
        thaw_processes();
  Finish:

Just to be sure, I'll do more tests, but I believe the panic can be fixed
by this according to my tests I have done so far.

I'm still not sure what the comment before scsi_device_quiesce() means:
 *  ... Since special requests may also be requeued requests,
 *      a successful return doesn't guarantee the device will be
 *      totally quiescent.

I don't know if there can be some other I/O submitted after
scsi_device_quiesce() returns in the case of hibernation, and I don't
know if aac_suspend() -> scsi_host_block() should be fixed/removed,
but as far as the panic is concerned, I'm very glad I have found a better
fix with your help. 

Thank you so much, Bart! 

Thanks,
-- Dexuan