Re: [PATCH v8 04/23] scsi: sd: Differentiate system and runtime start/stop management

Damien Le Moal <dlemoal@xxxxxxxxxx> · Mon, 23 Oct 2023 14:51:58 +0900

On 10/21/23 06:23, Phillip Susi wrote:
> Damien Le Moal <dlemoal@xxxxxxxxxx> writes:
> 
>> On my system, I see:
>>
>> cat /sys/class/ata_port/ata1/power/runtime_active_kids
>> 0
> 
> I see a 1 there, which is the single scsi_host.  The scsi_host has 2
> active kids; the two disks.  When I enabled runtime pm, only when the
> second disk was suspended did that allow the scsi_host to suspend, which
> then allowed the port to suspend.  Everything looked fine there so far.
> Then I tried:
> 
> echo 1 > /sys/block/sdf/device/delete
> 
> And the SCSI EH appears to have tried to wake up the disk, and hung in
> the process.
> 
> [  314.246282] sd 7:0:0:0: [sde] Synchronizing SCSI cache
> [  314.246445] sd 7:0:0:0: [sde] Stopping disk
> 
> First disk suspends.
> 
> [  388.518295] sd 7:1:0:0: [sdf] Synchronizing SCSI cache
> [  388.518519] sd 7:1:0:0: [sdf] Stopping disk
> 
> Second disk suspends some time later.
> 
> [  388.930428] ata8.00: Entering standby power mode
> [  389.330651] ata8.01: Entering standby power mode
> 
> That allowed the port to suspend.  This is when I tried to detach the
> disk driver, which I think tried to resume the disk before detaching,
> which resumed the port.
> 
> [  467.511878] ata8.15: SATA link down (SStatus 0 SControl 310)
> [  468.142726] ata8.15: failed to read PMP GSCR[0] (Emask=0x100)
> [  468.142741] ata8.15: PMP revalidation failed (errno=-5)
> 
> I ran hdparm -C on the other disk at this point.  I just noticed that
> the ata8.15 that represents the PMP itself was NOT suspended along with
> the two drive links, and then maybe was not resumed before trying to
> revalidate the PMP?  And that's why it failed?
> 
> [  473.172792] ata8.15: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> [  473.486860] ata8.00: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> [  473.802139] ata8.01: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> 
> It seems like it ended up recovering here though?  And yet the scsi_eh
> remained hung, as did the hdparm -C:
> 
> [  605.566814] INFO: task scsi_eh_7:173 blocked for more than 120 seconds.
> [  605.566829]       Not tainted 6.6.0-rc5+ #5
> [  605.566834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  605.566838] task:scsi_eh_7       state:D stack:0     pid:173   ppid:2      flags:0x00004000
> [  605.566850] Call Trace:
> [  605.566853]  <TASK>
> [  605.566860]  __schedule+0x37c/0xb70
> [  605.566878]  schedule+0x61/0xd0
> [  605.566888]  rpm_resume+0x156/0x760

Looks like a deadlock somewhere, likely with the device remove that you
triggered with the "echo 1 > /sys/block/sdf/device/delete".

Can you send the exact list of commands & events you executed to get to that
point ? Also please share your kernel config.

-- 
Damien Le Moal
Western Digital Research