On Wed, 4 Oct 2023 15:18:34 +0900 Damien Le Moal <dlemoal@xxxxxxxxxx> wrote: > On 10/4/23 15:13, Hannes Reinecke wrote: > > On 10/3/23 17:18, Petr Tesařík wrote: > >> On Tue, 3 Oct 2023 14:57:46 +0200 > >> "Rafael J. Wysocki" <rafael@xxxxxxxxxx> wrote: > >> > >>> On Tue, Oct 3, 2023 at 2:51 PM Petr Tesařík <petr@xxxxxxxxxxx> wrote: > >>>> > >>>> On Tue, 3 Oct 2023 14:48:13 +0200 > >>>> "Rafael J. Wysocki" <rafael@xxxxxxxxxx> wrote: > >>>> > >>>>> On Tue, Oct 3, 2023 at 2:40 PM Petr Tesařík <petr@xxxxxxxxxxx> wrote: > >>>>>> > >>>>>> On Tue, 3 Oct 2023 14:34:56 +0200 > >>>>>> "Rafael J. Wysocki" <rafael@xxxxxxxxxx> wrote: > >>>>>> > >>>>>>> On Tue, Oct 3, 2023 at 1:02 PM Petr Tesařík <petr@xxxxxxxxxxx> wrote: > >>>>>>>> > >>>>>>>> On Tue, 3 Oct 2023 12:15:10 +0200 > >>>>>>>> "Rafael J. Wysocki" <rafael@xxxxxxxxxx> wrote: > >>>>>>>> > >>>>>>>>> On Tue, Oct 3, 2023 at 11:31 AM Petr Tesařík <petr@xxxxxxxxxxx> wrote: > >>>>>>>>>> > >>>>>>>>>> Hi again (adding more recipients), > >>>>>>>>>> > >>>>>>>>>> On Sat, 30 Sep 2023 12:20:54 +0200 > >>>>>>>>>> Petr Tesařík <petr@xxxxxxxxxxx> wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi all, > >>>>>>>>>>> > >>>>>>>>>>> this time no patch (yet). In short, my Thinkpad running v6.6-rc3 fails > >>>>>>>>>>> to resume from S3. It also fails the same way with Tumbleweed v6.5 > >>>>>>>>>>> kernel. I was able to capture a crash dump of the v6.5 kernel, and > >>>>>>>>>>> here's my analysis: > >>>>>>>>>>> > >>>>>>>>>>> The system never gets to waking up my SATA SSD disk: > >>>>>>>>>>> > >>>>>>>>>>> [0:0:0:0] disk ATA KINGSTON SEDC600 H5.1 /dev/sda > >>>>>>>>>>> > >>>>>>>>>>> There is a pending resume work for kworker/u32:12 (PID 11032), but this > >>>>>>>>>>> worker is stuck in 'D' state: > >>>>>>>>>>> > >>>>>>>>>>>>>> prog.stack_trace(11032) > >>>>>>>>>>> #0 context_switch (../kernel/sched/core.c:5381:2) > >>>>>>>>>>> #1 __schedule (../kernel/sched/core.c:6710:8) > >>>>>>>>>>> #2 schedule (../kernel/sched/core.c:6786:3) > >>>>>>>>>>> #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2) > >>>>>>>>>>> #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3) > >>>>>>>>>>> #5 __mutex_lock (../kernel/locking/mutex.c:747:9) > >>>>>>>>>>> #6 acpi_device_hotplug (../drivers/acpi/scan.c:382:2) > >>>>>>>>>>> #7 acpi_hotplug_work_fn (../drivers/acpi/osl.c:1162:2) > >>>>>>>>>>> #8 process_one_work (../kernel/workqueue.c:2600:2) > >>>>>>>>>>> #9 worker_thread (../kernel/workqueue.c:2751:4) > >>>>>>>>>>> #10 kthread (../kernel/kthread.c:389:9) > >>>>>>>>>>> #11 ret_from_fork (../arch/x86/kernel/process.c:145:3) > >>>>>>>>>>> #12 ret_from_fork_asm+0x1b/0x20 (../arch/x86/entry/entry_64.S:304) > >>>>>>>>>>> > >>>>>>>>>>> acpi_device_hotplug() tries to acquire acpi_scan_lock, which is held by > >>>>>>>>>>> systemd-sleep (PID 11002). This task is also in 'D' state: > >>>>>>>>>>> > >>>>>>>>>>>>>> prog.stack_trace(11002) > >>>>>>>>>>> #0 context_switch (../kernel/sched/core.c:5381:2) > >>>>>>>>>>> #1 __schedule (../kernel/sched/core.c:6710:8) > >>>>>>>>>>> #2 schedule (../kernel/sched/core.c:6786:3) > >>>>>>>>>>> #3 schedule_preempt_disabled (../kernel/sched/core.c:6845:2) > >>>>>>>>>>> #4 __mutex_lock_common (../kernel/locking/mutex.c:679:3) > >>>>>>>>>>> #5 __mutex_lock (../kernel/locking/mutex.c:747:9) > >>>>>>>>>>> #6 device_lock (../include/linux/device.h:958:2) > >>>>>>>>>>> #7 device_complete (../drivers/base/power/main.c:1063:2) > >>>>>>>>>>> #8 dpm_complete (../drivers/base/power/main.c:1121:3) > >>>>>>>>>>> #9 suspend_devices_and_enter (../kernel/power/suspend.c:516:2) > >>>>>>>>>> > >>>>>>>>>> I believe the issue must be somewhere here. The whole suspend and > >>>>>>>>>> resume logic in suspend_devices_and_enter() is framed by > >>>>>>>>>> platform_suspend_begin() and platform_resume_end(). > >>>>>>>>>> > >>>>>>>>>> My system is an ACPI system, so suspend_ops contains: > >>>>>>>>>> > >>>>>>>>>> .begin = acpi_suspend_begin, > >>>>>>>>>> .end = acpi_pm_end, > >>>>>>>>>> > >>>>>>>>>> Now, acpi_suspend_begin() acquires acpi_scan_lock through > >>>>>>>>>> acpi_pm_start(), and the lock is not released until acpi_pm_end(). > >>>>>>>>>> Since dpm_complete() waits for the completion of a work that tries to > >>>>>>>>>> acquire acpi_scan_lock, the system will deadlock. > >>>>>>>>> > >>>>>>>>> So holding acpi_scan_lock across suspend-resume is basically to > >>>>>>>>> prevent the hotplug from taking place then IIRC. > >>>>>>>>> > >>>>>>>>>> AFAICS either: > >>>>>>>>>> > >>>>>>>>>> a. the ACPI lock cannot be held while dpm_complete() runs, or > >>>>>>>>>> b. ata_scsi_dev_rescan() must not be scheduled before the system is > >>>>>>>>>> resumed, or > >>>>>>>>>> c. acpi_device_hotplug() must be implemented without taking dev->mutex. > >>>>>>>>>> > >>>>>>>>>> My gut feeling is that b. is the right answer. > >>>>>>>>> > >>>>>>>>> It's been a while since I looked at that code last time, but then it > >>>>>>>>> has not changed for quite some time too. > >>>>>>>>> > >>>>>>>>> It looks like the acpi_device_hotplug() path attempts to acquire > >>>>>>>>> acpi_scan_lock() while holding dev->mutex which is kind of silly. I > >>>>>>>>> need to check that, though. > >>>>>>>> > >>>>>>>> Thanks for your willingness. Well, it's not quite what you describe. If > >>>>>>>> it was a simple ABBA deadlock, then it would be reported by lockdep. > >>>>>>>> No, it's more complicated: > >>>>>>>> > >>>>>>>> 1. suspend_devices_and_enter() holds acpi_scan_lock, > >>>>>>>> 2. an ACPI hotplug work runs, but acpi_device_hotplug() goes to sleep > >>>>>>>> when it gets to acquiring acpi_scan_lock, > >>>>>>>> 3. ata_scsi_dev_rescan() submits a SCSI command and waits for its > >>>>>>>> completion while holding dev->mutex, > >>>>>>>> 4. the SCSI completion work happens to be put on the same workqueue as > >>>>>>>> the ACPI hotplug work in step 2, > >>>>>>>> ^^^--- THIS is how the two events are serialized! > >>>>>>> > >>>>>>> Which is unexpected. > >>>>>>> > >>>>>>> And quite honestly I'm not sure how this can happen, because > >>>>>>> acpi_hotplug_schedule() uses a dedicated workqueue and it is called > >>>>>>> from (a) the "eject" sysfs attribute (which cannot happen while system > >>>>>>> suspend-resume is in progress) and (b) acpi_bus_notify() which has > >>>>>>> nothing to do with SCSI. > >>>>>> > >>>>>> Oh, you're right, and I was too quick. They cannot be on the same > >>>>>> queue... > >>>>>> > >>>>>>> Maybe the workqueue used for the SCSI completion is freezable? > >>>>>> > >>>>>> Yes, that's it: > >>>>>> > >>>>>> *(struct workqueue_struct *)0xffff97d240b2fe00 = { > >>>>>> /* ... */ > >>>>>> .flags = (unsigned int)4, > >>>>>> /* WQ_FREEZABLE = 1 << 2 */ > >>>>>> > >>>>>> Good. But if this workqueue is frozen, the system still cannot make > >>>>>> progress. > >>>>> > >>>>> The problem seems to be that dev->mutex is held while the work item > >>>>> goes to a freezable workqueue and is waited for, which is an almost > >>>>> guaranteed deadlock scenario. > >>>> > >>>> Ah. Thanks for explanation and direction! I'm going to dive into the > >>>> block layer and/or SCSI code and bug other people with my findings. > >>> > >>> Please feel free to CC me on that in case I can help. > >> > >> And here I am again... The frozen workqueue is in fact pm_wq, and the > >> work item that is waited for is pm_runtime_work. The block layer calls > >> pm_request_resume() on the device to resume the queue. > >> > >> I bet the queue should not be resumed this early. In fact, it seems > >> that this is somewhat known to the ATA developers, because > >> ata_scsi_dev_rescan() contains this beautiful comment and code: > >> > >> /* > >> * If the rescan work was scheduled because of a resume > >> * event, the port is already fully resumed, but the > >> * SCSI device may not yet be fully resumed. In such > >> * case, executing scsi_rescan_device() may cause a > >> * deadlock with the PM code on device_lock(). Prevent > >> * this by giving up and retrying rescan after a short > >> * delay. > >> */ > >> delay_rescan = sdev->sdev_gendev.power.is_suspended; > >> if (delay_rescan) { > >> scsi_device_put(sdev); > >> break; > >> } > >> > >> It just doesn't seem to work as expected, at least not in my case. > >> > > Did you test the libata suspend/resume patches from Damien? > > I think so since the hang was repeated on 6.6-rc4, as mentioned in another email > in this thread. Indeed. I originally hit the hang on 6.5. I captured a coredump and did most of my analysis on that version (hence the quoted code). But I tested again with 6.6-rc4 just before sending my email, and the hang is still there. The rescan work runs after dpm_resume(), but before dpm_complete(), and I believe that's yet another issue. Petr T