On 9/6/23 02:17, Rodrigo Vivi wrote: >> I think I have now figured it out, and fixed. I could reliably recreate the same >> hang both with qemu using a failed suspend (using a device not supporting >> suspend) and real hardware with a short rtc wake. >> >> It turns out that the root cause of the hang is ata_scsi_dev_rescan(), which is >> scheduled asynchronously from PM context on resume. With quick suspend after a >> resume, suspend may win the race against that ata_scsi_dev_rescan() task >> execution and we endup calling scsi_rescan_device() on a suspended device, >> causing that function to wait with the device_lock() held, which causes PM to >> deadlock when it needs to resume the scsi device. The recent commit 6aa0365a3c85 >> ("ata: libata-scsi: Avoid deadlock on rescan after device resume") was intended >> to fix that, but it did so less than ideally and the fix has a race on the scsi >> power state check, thus not always preventing the resume hang. >> >> I pushed a new patch series that goes on top of 6.5.0: resume-v3 branch in the >> libata tree: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.git >> >> This works very well for me. Using this script on real hardware: >> >> for (( i=0; i<20; i++ )); do >> echo "+2" > /sys/class/rtc/rtc0/wakealarm >> echo mem > /sys/power/state >> done >> >> The system repeatedly suspends and resumes and comes back OK. Of note is that if >> I set the delay to +1 second, then I sometime do not see the system resume and >> the script stops. But using wakeup-on-lan (wol command) from another machine to >> wake it up, the machine resumes normally and continues executing the script. So >> it seems that setting the rtc alarm unreasonably early result in it being lost >> and the system suspending wating to be woken up. >> >> I also tested this in qemu. As mentioned before, I cannot get rtc alarm to wake >> up the VM guest though. However, using a virtio device that does not support >> suspend, resume strats in the middle of the suspend operation due to the suspend >> error reported by that device. And it turns out that systemd really insists on >> suspending the system despite the error, so when running "systemctl suspend" I >> see a retry for suspend right after the first failed one. That is enough to >> trigger the issue without the patches. >> >> Please test ! > > \o/ works for me! > > Feel free to use: > Tested-by: Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx> Awesome ! Thank you for testing. I will rebase the patches and post the official version for 6.6 fixes (and the other cleanup patches for 6.7), after retesting again. Never know :) -- Damien Le Moal Western Digital Research