On 8/26/23 02:09, Rodrigo Vivi wrote: >>> So, maybe we have some kind of disks/configuration out there where this >>> start upon resume is needed? Maybe it is just a matter of timming to >>> ensure some firmware underneath is up and back to life? >> >> I do not think so. Suspend will issue a start stop unit command to put the drive >> to sleep and resume will reset the port (which should wake up the drive) and >> then issue an IDENTIFY command (which will also wake up the drive) and other >> read logs etc to rescan the drive. >> In both cases, if the commands do not complete, we would see errors/timeout and >> likely port reset/drive gone events. So I think this is likely another subtle >> race between scsi suspend and ata suspend that is causing a deadlock. >> >> The main issue I think is that there is no direct ancestry between the ata port >> (device) and scsi device, so the change to scsi async pm ops made a mess of the >> suspend/resume operations ordering. For suspend, scsi device (child of ata port) >> should be first, then ata port device (parent). For resume, the reverse order is >> needed. PM normally ensures that parent/child ordering, but we lack that >> parent/child relationship. I am working on fixing that but it is very slow >> progress because I have been so far enable to recreate any of the issues that >> have been reported. I am patching "blind"... > > I believe your suspicious makes sense. And on these lines, that patch you > attached earlier would fix that. However my initial tries of that didn't > help. I'm going to run more tests and get back to you. Rodrigo, I pushed the resume-v2 branch to libata tree: git@xxxxxxxxxxxxxxxxxxx:pub/scm/linux/kernel/git/dlemoal/libata (or https://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.git) This branch adds 13 patches on top of 6.5.0 to cleanup libata suspend/resume and other device shutdown issues. The first 4 patches are the main ones to fix suspend resume. I tested that on 2 different machines with different drives and with qemu. All seems fine. Could you try to run this through your CI ? I am very interested in seeing if it survives your suspend/resume tests. If you can confirm that all issues are fixed, I will rebase this on for-next and post. Thanks ! -- Damien Le Moal Western Digital Research