Re: [PATCH] ata,scsi: do not issue START STOP UNIT on resume

Damien Le Moal <dlemoal@xxxxxxxxxx> · Wed, 6 Sep 2023 10:07:07 +0900

On 9/6/23 02:17, Rodrigo Vivi wrote:
>> I think I have now figured it out, and fixed. I could reliably recreate the same
>> hang both with qemu using a failed suspend (using a device not supporting
>> suspend) and real hardware with a short rtc wake.
>>
>> It turns out that the root cause of the hang is ata_scsi_dev_rescan(), which is
>> scheduled asynchronously from PM context on resume. With quick suspend after a
>> resume, suspend may win the race against that ata_scsi_dev_rescan() task
>> execution and we endup calling scsi_rescan_device() on a suspended device,
>> causing that function to wait with the device_lock() held, which causes PM to
>> deadlock when it needs to resume the scsi device. The recent commit 6aa0365a3c85
>> ("ata: libata-scsi: Avoid deadlock on rescan after device resume") was intended
>> to fix that, but it did so less than ideally and the fix has a race on the scsi
>> power state check, thus not always preventing the resume hang.
>>
>> I pushed a new patch series that goes on top of 6.5.0: resume-v3 branch in the
>> libata tree:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.git
>>
>> This works very well for me. Using this script on real hardware:
>>
>> for (( i=0; i<20; i++ )); do
>> 	echo "+2" > /sys/class/rtc/rtc0/wakealarm
>> 	echo mem > /sys/power/state
>> done
>>
>> The system repeatedly suspends and resumes and comes back OK. Of note is that if
>> I set the delay to +1 second, then I sometime do not see the system resume and
>> the script stops. But using wakeup-on-lan (wol command) from another machine to
>> wake it up, the machine resumes normally and continues executing the script. So
>> it seems that setting the rtc alarm unreasonably early result in it being lost
>> and the system suspending wating to be woken up.
>>
>> I also tested this in qemu. As mentioned before, I cannot get rtc alarm to wake
>> up the VM guest though. However, using a virtio device that does not support
>> suspend, resume strats in the middle of the suspend operation due to the suspend
>> error reported by that device. And it turns out that systemd really insists on
>> suspending the system despite the error, so when running "systemctl suspend" I
>> see a retry for suspend right after the first failed one. That is enough to
>> trigger the issue without the patches.
>>
>> Please test !
> 
> \o/ works for me!
> 
> Feel free to use:
> Tested-by: Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx>

Awesome ! Thank you for testing. I will rebase the patches and post the official
version for 6.6 fixes (and the other cleanup patches for 6.7), after retesting
again. Never know :)

-- 
Damien Le Moal
Western Digital Research