> From: Bart Van Assche <bvanassche@xxxxxxx> > Sent: Wednesday, April 22, 2020 12:56 PM > On 4/21/20 11:24 PM, Dexuan Cui wrote: > > Upon suspend, I suppose the other LLDs can not accept I/O either, then > > what do they do when some kernel thread still tries to submit I/O? Do > > they "block" the I/O until resume, or just return an error immediately? > > This is my understanding of how other SCSI LLDs handle suspend/resume: > - The ULP driver, e.g. the SCSI sd driver, implements power management > support by providing callbacks in struct scsi_driver.gendrv.pm and also > in scsi_bus_type.pm. The SCSI sd suspend callbacks flush the device > cache and send a STOP command to the device. It looks the sd suspend callbacks are only for the I/O from the disk, e.g. from the file system that lives in some partition of some disk. The panic I'm seeing is not from sd. I think it's from a kernel thread that tries to detect the status of the SCSI CDROM. This is the snipped messages (the full version is at https://lkml.org/lkml/2020/4/10/47): here the suspend callbacks of the sd, sr and scsi_bus_type.pm have been called, and later the storvsc LLD's suspend callback is also called, but sr_block_check_events() can still try to submit SCSI commands to storvsc: [ 11.668741] sr 0:0:0:1: bus quiesce [ 11.668804] sd 0:0:0:0: bus quiesce [ 11.698082] scsi target0:0:0: bus quiesce [ 11.703296] scsi host0: bus quiesce [ 11.781730] hv_storvsc bf78936f-7d8f-45ce-ab03-6c341452e55d: noirq bus quiesce [ 11.796479] hv_netvsc dda5a2be-b8b8-4237-b330-be8a516a72c0: noirq bus quiesce [ 11.804042] BUG: kernel NULL pointer dereference, address: 0000000000000090 [ 11.804996] Workqueue: events_freezable_power_ disk_events_workfn [ 11.804996] RIP: 0010:storvsc_queuecommand+0x261/0x714 [hv_storvsc] [ 11.804996] Call Trace: [ 11.804996] scsi_queue_rq+0x593/0xa10 [ 11.804996] blk_mq_dispatch_rq_list+0x8d/0x510 [ 11.804996] blk_mq_sched_dispatch_requests+0xed/0x170 [ 11.804996] __blk_mq_run_hw_queue+0x55/0x110 [ 11.804996] __blk_mq_delay_run_hw_queue+0x141/0x160 [ 11.804996] blk_mq_sched_insert_request+0xc3/0x170 [ 11.804996] blk_execute_rq+0x4b/0xa0 [ 11.804996] __scsi_execute+0xeb/0x250 [ 11.804996] sr_check_events+0x9f/0x270 [sr_mod] [ 11.804996] cdrom_check_events+0x1a/0x30 [cdrom] [ 11.804996] sr_block_check_events+0xcc/0x110 [sr_mod] [ 11.804996] disk_check_events+0x68/0x160 [ 11.804996] process_one_work+0x20c/0x3d0 [ 11.804996] worker_thread+0x2d/0x3e0 [ 11.804996] kthread+0x10c/0x130 [ 11.804996] ret_from_fork+0x35/0x40 It looks the issue is: scsi_bus_freeze() -> ... -> scsi_dev_type_suspend -> scsi_device_quiesce() does not guarantee the device is totally quiescent: /** * scsi_device_quiesce - Block user issued commands. * @sdev: scsi device to quiesce. * * This works by trying to transition to the SDEV_QUIESCE state * (which must be a legal transition). When the device is in this * state, only special requests will be accepted, all others will * be deferred. Since special requests may also be requeued requests, * a successful return doesn't guarantee the device will be * totally quiescent. I guess the "special requests" may include the "detect the status of the SCSI CDROM" request from sr_block_check_events(). It looks scsi_device_quiesce() does not freeze the queue and it just puts the device to the SDEV_QUIESCE state, which is not enough to prevent sr_block_check_events from submitting SCSI commands. It looks scsi_host_block() freezes the queue and flushes all the pending I/O, so it can fix the aforementioned NULL pointer deference panic for me. > - SCSI LLDs for PCIe devices optionally provide pci_driver.suspend and > resume callbacks. These callbacks can be used to make the PCIe device > enter/leave a power saving state. No new SCSI commands should be > submitted after pci_driver.suspend has been called. > > > I had a look at drivers/scsi/xen-scsifront.c. It looks this LLD implements > > a mechanism of marking the device busy upon suspend, so any I/O will > > immediately get an error SCSI_MLQUEUE_HOST_BUSY. IMO the > > disadvantage is: the mechanism of marking the device busy looks > > complex, and the hot path .queuecommand() has to take the > > spinlock shost->host_lock, which should affect the performance. > > I think the code you are referring to is the code in > scsifront_suspend(). A pointer to that function is stored in a struct > xenbus_driver instance. That's another structure than the structures > mentioned above. > > Wouldn't it be better to make sure that any hypervisor suspend > operations happen after it is guaranteed that no further SCSI commands > will be submitted such that hypervisor suspend operations do not have to > deal with SCSI commands submitted during or after the hypervisor suspend > callback? I agree with you, but it looks scsi_bus_freeze() doesn't guarantee no further SCSI commands will be submitted, as I described above. Maybe that is why scsifront_suspend() has to invent a mechanism to cope with the issue. > > It looks drivers/scsi/nsp32.c: nsp32_suspend() and > > drivers/scsi/3w-9xxx.c: twa_suspend() do nothing to handle new I/O > > after suspend. I doubt this is correct. > > nsp32_suspend() is a PCI suspend callback. If any SCSI commands would be > submitted after that callback has started that would mean that the SCSI > suspend and PCIe suspend operations are called in the wrong order. I do > not agree that code for suspending SCSI commands should be added in > nsp32_suspend(). > > > So it looks to me there is no simple mechanism to handle the scenario > > here, and I guess that's why the scsi_host_block/unblock APIs are > > introduced, and actually there is already an user of the APIs: > > 3d3ca53b1639 ("scsi: aacraid: use scsi_host_(block,unblock) to block I/O"). > > > > The aacraid patch says: "This has the advantage that the block layer will > > stop sending I/O to the adapter instead of having the SCSI midlayer > > requeueing I/O internally". It looks this may imply that using the new > > APIs is encouraged? > > I'm fine with using these new functions in device reset handlers. Using > these functions in power management handlers seems wrong to me. It looks you're suggesting the scsi_host_block() in aac_suspend() should be removed? :-) > > PS, here storvsc has to destroy and re-construct the I/O queues: the > > I/O queues are shared memory ringbufers between the guest and the > > host; in the resume path of the hibernation procedure, the memory > > pages allocated by the 'new' kernel is different from that allocated by > > the 'old' kernel, so before jumping to the 'old' kernel, the 'new' kernel > > must destroy the mapping of the pages, and later after we jump to > > the 'old' kernel, we'll re-create the mapping using the pages allocated > > by the 'old' kernel. Here "create the mapping" means the guest tells > > the host about the physical addresses of the pages. > > Thank you for having clarified this. This helps me to understand the HV > driver framework better. I think this means that the hv_driver.suspend > function should be called at a later time than SCSI suspend. From I agree, but it looks here scsi_bus_freeze() doesn't work as we expected? > Documentation/driver-api/device_link.rst: "By default, the driver core > only enforces dependencies between devices that are borne out of a > parent/child relationship within the device hierarchy: When suspending, > resuming or shutting down the system, devices are ordered based on this > relationship, i.e. children are always suspended before their parent, > and the parent is always resumed before its children." Is there a single > storvsc_drv instance for all SCSI devices supported by storvsc_drv? Has Yes. > it been considered to make storvsc_drv the parent device of all SCSI > devices created by the storvsc driver? > > Thanks, > > Bart. Yes, I think so: root@localhost:~# ls -rtl /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943/host3/target3:0:0/3:0:0:0/driver lrwxrwxrwx 1 root root 0 Apr 22 01:10 /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943/host3/target3:0:0/3:0:0:0/driver -> ../../../../../../../../../../bus/scsi/drivers/sd Here the driver of /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943 is storvsc, which creates host3/target3:0:0/3:0:0:0. So it looks there is no ordering issue. Thanks, -- Dexuan