On Tue, Aug 29, 2023 at 03:17:38PM +0900, Damien Le Moal wrote: > On 8/26/23 02:09, Rodrigo Vivi wrote: > >>> So, maybe we have some kind of disks/configuration out there where this > >>> start upon resume is needed? Maybe it is just a matter of timming to > >>> ensure some firmware underneath is up and back to life? > >> > >> I do not think so. Suspend will issue a start stop unit command to put the drive > >> to sleep and resume will reset the port (which should wake up the drive) and > >> then issue an IDENTIFY command (which will also wake up the drive) and other > >> read logs etc to rescan the drive. > >> In both cases, if the commands do not complete, we would see errors/timeout and > >> likely port reset/drive gone events. So I think this is likely another subtle > >> race between scsi suspend and ata suspend that is causing a deadlock. > >> > >> The main issue I think is that there is no direct ancestry between the ata port > >> (device) and scsi device, so the change to scsi async pm ops made a mess of the > >> suspend/resume operations ordering. For suspend, scsi device (child of ata port) > >> should be first, then ata port device (parent). For resume, the reverse order is > >> needed. PM normally ensures that parent/child ordering, but we lack that > >> parent/child relationship. I am working on fixing that but it is very slow > >> progress because I have been so far enable to recreate any of the issues that > >> have been reported. I am patching "blind"... > > > > I believe your suspicious makes sense. And on these lines, that patch you > > attached earlier would fix that. However my initial tries of that didn't > > help. I'm going to run more tests and get back to you. > > Rodrigo, > > I pushed the resume-v2 branch to libata tree: > > git@xxxxxxxxxxxxxxxxxxx:pub/scm/linux/kernel/git/dlemoal/libata > (or https://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata.git) > > This branch adds 13 patches on top of 6.5.0 to cleanup libata suspend/resume and > other device shutdown issues. The first 4 patches are the main ones to fix > suspend resume. I tested that on 2 different machines with different drives and > with qemu. All seems fine. > > Could you try to run this through your CI ? I am very interested in seeing if it > survives your suspend/resume tests. well, in the end this didn't affect the CI machinery as I was afraid. it is only in my local DG2. https://intel-gfx-ci.01.org/tree/intel-xe/bat-all.html?testfilter=suspend (bat-dg2-oem2 one) I just got these 13 patches and applied to my branch and tested it again and it still *fails* for me. [ 79.648328] [IGT] kms_pipe_crc_basic: finished subtest pipe-A-DP-2, SUCCESS [ 79.657353] [IGT] kms_pipe_crc_basic: starting dynamic subtest pipe-B-DP-2 [ 80.375042] PM: suspend entry (deep) [ 80.380799] Filesystems sync: 0.002 seconds [ 80.386476] Freezing user space processes [ 80.392286] Freezing user space processes completed (elapsed 0.001 seconds) [ 80.399294] OOM killer disabled. [ 80.402536] Freezing remaining freezable tasks [ 80.408335] Freezing remaining freezable tasks completed (elapsed 0.001 seconds) [ 80.439372] sd 5:0:0:0: [sdb] Synchronizing SCSI cache [ 80.439716] serial 00:01: disabled [ 80.448011] sd 4:0:0:0: [sda] Synchronizing SCSI cache [ 80.448014] sd 7:0:0:0: [sdc] Synchronizing SCSI cache [ 80.453600] ata6.00: Entering standby power mode [ 80.464217] ata5.00: Entering standby power mode [ 80.812294] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 80.818520] ata8.00: Entering active power mode [ 80.842989] ata8.00: configured for UDMA/133 [ 80.847660] ata8.00: Entering standby power mode [ 81.119426] xe 0000:03:00.0: [drm] GT0: suspended [ 81.800508] PM: suspend of devices complete after 1367.829 msecs [ 81.806661] PM: start suspend of devices complete after 1390.859 msecs [ 81.813244] PM: suspend devices took 1.398 seconds [ 81.820101] PM: late suspend of devices complete after 2.036 msecs �[ 82.403857] serial 00:01: activated [ 82.489612] nvme nvme0: 16/0/0 default/read/poll queues [ 82.563318] r8169 0000:07:00.0 enp7s0: Link is Down [ 82.581444] xe REG[0x223a8-0x223af]: allow read access [ 82.586704] xe REG[0x1c03a8-0x1c03af]: allow read access [ 82.592071] xe REG[0x1d03a8-0x1d03af]: allow read access [ 82.597423] xe REG[0x1c83a8-0x1c83af]: allow read access [ 82.602765] xe REG[0x1d83a8-0x1d83af]: allow read access [ 82.608113] xe REG[0x1a3a8-0x1a3af]: allow read access [ 82.613281] xe REG[0x1c3a8-0x1c3af]: allow read access [ 82.618454] xe REG[0x1e3a8-0x1e3af]: allow read access [ 82.623634] xe REG[0x263a8-0x263af]: allow read access [ 82.628816] xe 0000:03:00.0: [drm] GT0: resumed [ 82.728005] ata7: SATA link down (SStatus 4 SControl 300) [ 82.733531] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 82.739773] ata5.00: Entering active power mode [ 82.744398] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 82.750618] ata6.00: Entering active power mode [ 82.755961] ata5.00: configured for UDMA/133 [ 82.760479] ata5.00: Enabling discard_zeroes_data [ 82.836266] ata6.00: configured for UDMA/133 [ 84.460081] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [ 84.466354] ata8.00: Entering active power mode [ 84.497256] ata8.00: configured for UDMA/133 ... > > If you can confirm that all issues are fixed, I will rebase this on for-next and > post. > > Thanks ! > > > -- > Damien Le Moal > Western Digital Research >