On 9/14/23 05:50, Bart Van Assche wrote: > On 9/10/23 21:02, Damien Le Moal wrote: >> If an error occurs when resuming a host adapter before the devices >> attached to the adapter are resumed, the adapter low level driver may >> remove the scsi host, resulting in a call to sd_remove() for the >> disks of the host. However, since this function calls sd_shutdown(), >> a synchronize cache command and a start stop unit may be issued with the >> drive still sleeping and the HBA non-functional. This causes PM resume >> to hang, forcing a reset of the machine to recover. >> >> Fix this by checking a device host state in sd_shutdown() and by >> returning early doing nothing if the host state is not SHOST_RUNNING. >> >> Cc: stable@xxxxxxxxxxxxxxx >> Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx> >> --- >> drivers/scsi/sd.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c >> index c92a317ba547..a415abb721d3 100644 >> --- a/drivers/scsi/sd.c >> +++ b/drivers/scsi/sd.c >> @@ -3763,7 +3763,8 @@ static void sd_shutdown(struct device *dev) >> if (!sdkp) >> return; /* this can happen */ >> >> - if (pm_runtime_suspended(dev)) >> + if (pm_runtime_suspended(dev) || >> + sdkp->device->host->shost_state != SHOST_RUNNING) >> return; >> >> if (sdkp->WCE && sdkp->media_present) { > > Why to test the host state instead of dev->power.runtime_status? I don't > think that it is safe to skip shutdown if the error handler is active. > If the error handler can recover the device a SYNCHRONIZE CACHE command > should be submitted. But there is no synchronization with EH that I can see anyway. At least for sd_remove(), I would assume that this is called only once the device references were all dropped, so presumably EH is not doing anything with the drive when that happen, no ? In any case, looking at dev->power.runtime_status is not correct as this is set to RPM_ACTIVE when the device is suspended through system suspend. We could replace the test "sdkp->device->host->shost_state != SHOST_RUNNING" with "dev->power.is_suspended", as that indicates true (1) for a suspended device. However, I really do not like that as that is a PM internal field and should not be accessing it directly. The PM code comments say as much. Any better idea ? -- Damien Le Moal Western Digital Research