Re: [PATCH v8 04/23] scsi: sd: Differentiate system and runtime start/stop management

Damien Le Moal <dlemoal@xxxxxxxxxx> · Fri, 10 Nov 2023 09:43:52 +0900

On 11/10/23 07:09, Phillip Susi wrote:
> Phillip Susi <phill@xxxxxxxxxxxx> writes:
> 
>> I hadn't pulled in some time.  Updating now.
> 
> Great... the latest post 6.6 kernel hangs on entry to suspend for me.
> Bisecting it now back to the 6.6-rc5 that was previously working.

I checked again and all is fine for me on qemu and on my test systems.
With qemu, I do see this on resume:

[  168.724355] ACPI: PM: Low-level resume complete
[  168.726261] ACPI: PM: Restoring platform NVS memory
[  168.729790] ------------[ cut here ]------------
[  168.731076] WARNING: CPU: 0 PID: 938 at drivers/base/syscore.c:103
syscore_resume+0x1f9/0x230
[  168.734205] ---[ end trace 0000000000000000 ]---
[  168.736875] Enabling non-boot CPUs ...
[  168.739185] smpboot: Booting Node 0 Processor 1 APIC 0x1
[  168.748806] CPU1 is up
...

That is new. But that may be a qemu issue.

For the real hardware, which includes a PMP box connected to an ASMedia adapter,
I get this on resume for the drives connected to PMP:

[58290.563637] ata10.00: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[58291.043728] ata10.01: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[58291.523829] ata10.02: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[58291.844544] ata10.03: SATA link down (SStatus 0 SControl 330)
[58292.164943] ata10.04: SATA link down (SStatus 0 SControl 330)
[58292.485311] ata10.05: SATA link down (SStatus 0 SControl 330)
[58292.804215] ata10.06: SATA link down (SStatus 0 SControl 330)
[58293.123028] ata10.07: SATA link down (SStatus 0 SControl 330)
[58293.443913] ata10.08: SATA link down (SStatus 0 SControl 330)
[58293.763285] ata10.09: SATA link down (SStatus 0 SControl 330)
[58295.373838] ata10.00: configured for UDMA/133
[58295.378596] ata10.00: Entering active power mode
[58305.536662] ata10.00: qc timeout after 10000 msecs (cmd 0x40)
[58305.549455] ata10.00: VERIFY failed (err_mask=0x4)
[58305.560870] ata10.01: failed to read SCR 0 (Emask=0x40)
[58305.570953] ata10.01: failed to IDENTIFY (I/O error, err_mask=0x40)
[58305.580431] ata10.01: revalidation failed (errno=-5)
[58308.047142] ata10.00: hard resetting link
[58308.529006] ata10.00: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[58309.017039] ata10.01: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[58309.505029] ata10.02: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[58309.522564] ata10.00: configured for UDMA/133
[58309.541098] ata10.01: configured for UDMA/133
[58309.554900] ata10.01: Entering active power mode
[58319.872693] ata10.01: qc timeout after 10000 msecs (cmd 0x40)
[58319.883979] ata10.01: VERIFY failed (err_mask=0x4)
[58319.894481] ata10.02: failed to read SCR 0 (Emask=0x40)
[58319.904346] ata10.02: failed to IDENTIFY (I/O error, err_mask=0x40)
[58319.913584] ata10.02: revalidation failed (errno=-5)
[58321.142463] ata10.00: hard resetting link
[58321.625077] ata10.00: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[58321.638685] ata10.01: hard resetting link
[58322.121098] ata10.01: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[58322.609043] ata10.02: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[58322.930618] ata10.03: SATA link down (SStatus 0 SControl 330)
[58323.258625] ata10.04: SATA link down (SStatus 0 SControl 330)
[58323.585871] ata10.05: SATA link down (SStatus 0 SControl 330)
[58323.914277] ata10.06: SATA link down (SStatus 0 SControl 330)
[58324.242533] ata10.07: SATA link down (SStatus 0 SControl 330)
[58324.570485] ata10.08: SATA link down (SStatus 0 SControl 330)
[58324.898665] ata10.09: SATA link down (SStatus 0 SControl 330)
[58326.545243] ata10.00: configured for UDMA/133
[58326.557807] ata10.01: configured for UDMA/133
[58326.568930] ata10.02: configured for UDMA/133
[58326.578750] ata10.02: Entering active power mode

Note the timeouts for verify. They are due to the disks being slow to spin up.
Using the same timeouts as for scan on boot causes this. With a regular boot,
the time from system power up to the time libata starts probing is enough to
have the disks fully spun up, but with a resume, that same time is very short
and the drives have barely started spinning. So depending on the drives, the
initial 10s timeout is simply not enough. The second timeout after the hard
reset is however long enough for the drives to spinup and revalidation to
complete without any issue. All the drives are back.

Will patch that to increase the timeouts for verify command after a resume, to
avoid this useless hardreset.

-- 
Damien Le Moal
Western Digital Research