Hi again (adding more recipients), On Sat, 30 Sep 2023 12:20:54 +0200 Petr Tesařík <petr@xxxxxxxxxxx> wrote: > Hi all, > > this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails > to resume from S3. It also fails the same way with Tumbleweed v6.5 > kernel. I was able to capture a crash dump of the v6.5 kernel, and > here's my analysis: > > The system never gets to waking up my SATA SSD disk: > > [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda > > There is a pending resume work for kworker/u32:12 (PID 11032), but this > worker is stuck in 'D' state: > > >>> prog.stack_trace(11032) > #0 context_switch (../kernel/sched/core.c:5381:2) > #1 __schedule (../kernel/sched/core.c:6710:8) > #2 schedule (../kernel/sched/core.c:6786:3) > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2) > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3) > #5 __mutex_lock (../kernel/locking/mutex.c:747:9) > #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2) > #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2) > #8 process_one_work (../kernel/workqueue.c:2600:2) > #9 worker_thread (../kernel/workqueue.c:2751:4) > #10 kthread (../kernel/kthread.c:389:9) > #11 ret_from_fork (../arch/x86/kernel/process.c:145:3) > #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304) > > acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by > systemd-sleep (PID 11002). This task is also in 'D' state: > > >>> prog.stack_trace(11002) > #0 context_switch (../kernel/sched/core.c:5381:2) > #1 __schedule (../kernel/sched/core.c:6710:8) > #2 schedule (../kernel/sched/core.c:6786:3) > #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2) > #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3) > #5 __mutex_lock (../kernel/locking/mutex.c:747:9) > #6 device_lock (../include/linux/device.h:958:2) > #7 device_complete (../drivers/base/power/main.c:1063:2) > #8 dpm_complete (../drivers/base/power/main.c:1121:3) > #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2) I believe the issue must be somewhere here. The whole suspend and resume logic in suspend_devices_and_enter() is framed by platform_suspend_begin() and platform_resume_end(). My system is an ACPI system, so suspend_ops contains: .begin = acpi_suspend_begin, .end = acpi_pm_end, Now, acpi_suspend_begin() acquires acpi_scan_lock through acpi_pm_start(), and the lock is not released until acpi_pm_end(). Since dpm_complete() waits for the completion of a work that tries to acquire acpi_scan_lock, the system will deadlock. AFAICS either: a. the ACPI lock cannot be held while dpm_complete() runs, or b. ata_scsi_dev_rescan() must not be scheduled before the system is resumed, or c. acpi_device_hotplug() must be implemented without taking dev->mutex. My gut feeling is that b. is the right answer. I'm afraid I can't contribute too much more ATM, at least not before I find out what broke my USB hub between rc3 and rc4. Regards Petr T > #10 enter_state (../kernel/power/suspend.c:592:10) > #11 pm_suspend (../kernel/power/suspend.c:619:10) > #12 state_store (../kernel/power/main.c:707:11) > #13 kernfs_fop_write_iter (../fs/kernfs/file.c:334:9) > #14 call_write_iter (../include/linux/fs.h:1877:9) > #15 new_sync_write (../fs/read_write.c:491:8) > #16 vfs_write (../fs/read_write.c:584:9) > #17 ksys_write (../fs/read_write.c:637:9) > #18 do_syscall_x64 (../arch/x86/entry/common.c:50:14) > #19 do_syscall_64 (../arch/x86/entry/common.c:80:7) > #20 entry_SYSCALL_64+0xaa/0x1a6 (../arch/x86/entry/entry_64.S:120) > > It is trying to acquire dev->mutex, which is in turn held by > kworker/0:0 (PID 10830), also in 'D' state: > > >>> prog.stack_trace(10830) > #0 context_switch (../kernel/sched/core.c:5381:2) > #1 __schedule (../kernel/sched/core.c:6710:8) > #2 schedule (../kernel/sched/core.c:6786:3) > #3 blk_queue_enter (../block/blk-core.c:326:3) > #4 blk_mq_alloc_request (../block/blk-mq.c:592:9) > #5 scsi_alloc_request (../drivers/scsi/scsi_lib.c:1139:7) > #6 scsi_execute_cmd (../drivers/scsi/scsi_lib.c:217:8) > #7 scsi_vpd_inquiry (../drivers/scsi/scsi.c:312:11) > #8 scsi_get_vpd_size (../drivers/scsi/scsi.c:345:11) > #9 scsi_get_vpd_size (../drivers/scsi/scsi.c:336:5) > #10 scsi_get_vpd_buf (../drivers/scsi/scsi.c:415:12) > #11 scsi_attach_vpd (../drivers/scsi/scsi.c:483:12) > #12 scsi_rescan_device (../drivers/scsi/scsi_scan.c:1628:2) > #13 ata_scsi_dev_rescan (../drivers/ata/libata-scsi.c:4894:4) > #14 process_one_work (../kernel/workqueue.c:2600:2) > #15 worker_thread (../kernel/workqueue.c:2751:4) > #16 kthread (../kernel/kthread.c:389:9) > #17 ret_from_fork (../arch/x86/kernel/process.c:145:3) > #18 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304) > > And here we are, waiting for the completion of the resume work that is > scheduled on the blocked kworker/u32:12 (PID 11032), see above. > > FWIW the deadlock does not happen if autosuspend is turned off for this > disk. > > I'm willing to invest more time into debugging, but I'm not well-versed > in power management, so I may need some help from you as to what else I > should look at. Please, include me in Cc of all potential replies; I'm > not subscribed to linux-pm. > > Petr T