[Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16

bugzilla-daemon@xxxxxxxxxx · Fri, 26 Aug 2022 07:00:38 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #39 from damien.lemoal@xxxxxxxxxxxxxxxxxx ---
On 8/26/22 07:15, bugzilla-daemon@xxxxxxxxxx wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215880
> 
> --- Comment #38 from damien.lemoal@xxxxxxxxxxxxxxxxxx ---
> On 8/26/22 05:31, bugzilla-daemon@xxxxxxxxxx wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=215880
>>
>> --- Comment #37 from Bart Van Assche (bvanassche@xxxxxxx) ---
>> On 8/25/22 13:01, bugzilla-daemon@xxxxxxxxxx wrote:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=215880
>>>
>>> --- Comment #36 from jason600 (jason600.groome@xxxxxxxxx) ---
>>> (In reply to Bart Van Assche from comment #27)
>>>> Thanks for testing! The patches from the sd-resume branch have been posted
>>>> on the linux-scsi mailing list. See also
>>>>
>>>>
>>>>
>>>> https://lore.kernel.org/linux-scsi/20220628222131.14780-1-bvanassche@xxxxxxx/
>>>> T/#t
>>>
>>> Hi Bart, just an update for you. I noticed this had been removed from the
>>> 6.0-rc1 for freezing after suspend.
>>>
>>> I've been compiling my kernel with this fix on various 5.18 kernels (with
>>> opensuse tumbleweed), it has worked fine, no freezing on resume as others
>>> have
>>> mentioned.
>>>
>>> Yesterday, I updated to 5.19.2 kernel, applied the fix, recompiled, and it
>>> froze after the first suspend. Rebooted and the same thing happened again.
>>> I
>>> recompiled the kernel with the fix, just to make sure i didn't mess it up,
>>> and
>>> the same happened again.
>>>
>>> When you originally did this fix, you based it on 5.18, and indeed, it
>>> works
>>> fine on 5.18 for me. There were a lot of changes to the drivers/scsi/sd.c
>>> file
>>> for 5.19, presumably it was those changes that made this fix start freezing
>>> after suspend.
>>>
>>> Perhaps you could check if the other people that experienced freezing were
>>> using either 5.19 or 6.0-rc1.
>>
>> Multiple people reported issues with freezes during suspend with kernel 
>> v6.0-rc1. Please take a look at the following report: 
>>
>> https://lore.kernel.org/all/dd6844e7-f338-a4e9-2dad-0960e25b2ca1@xxxxxxxxxx/. 
>> It shows that if zoned ATA disks are present that blk_mq_freeze_queue() 
>> may be called from inside ata_scsi_dev_rescan() on the context of a work 
>> queue. ATA rescanning happens from inside the SCSI error handler. So 
>> there is potential for a lockup because of the following:
>> * Execution of the START command being postponed because the SCSI error 
>> handler is active.
>> * blk_mq_freeze_queue() waiting for the START command to finish.
>> * The START completion handler not being executed because it got queued 
>> on the same work queue as the ATA rescan work.

Checking the code, the dev pm resume call chain for ATA look like this.

ata_port_resume() -> ata_port_request_pm() -> ata_port_schedule_eh() ->
scsi_error_handler() thread runs -> shost->transportt->eh_strategy_handler
== ata_scsi_error() -> ata_scsi_port_error_handler() ->
ata_eh_handle_port_resume() -> ap->ops->port_resume() ==
ahci_port_resume() for AHCI adapters.

There are no commands issued to the device by this chain, only
registers/port settings being changed. So this should always complete
quickly and in itself not be the reason for the START_STOP command issued
by the sd driver to get stuck.

After calling ata_eh_handle_port_resume(), ata_scsi_port_error_handler()
will trigger a reset through ata_std_error_handler() -> ata_do_eh() and
that will cause the device rescan in EH context. Device rescan will
definitely spin up the device so the START_STOP command that sd_resume()
issues seems rather useless...

As a hack, it would be good to try something like this:

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 29e2f55c6faa..1bc92c04f048 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1081,6 +1081,7 @@ int ata_scsi_dev_config(struct scsi_device *sdev,
struct ata_device *dev)
        } else {
                sdev->sector_size = ata_id_logical_sector_size(dev->id);
                sdev->manage_start_stop = 1;
+               sdev->no_start_on_resume = 1;
        }

        /*
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 8f79fa6318fe..4c28ca4d038b 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3761,7 +3761,7 @@ static int sd_suspend_runtime(struct device *dev)
 static int sd_resume(struct device *dev)
 {
        struct scsi_disk *sdkp = dev_get_drvdata(dev);
-       int ret;
+       int ret = 0;

        if (!sdkp)      /* E.g.: runtime resume at the start of sd_probe() */
                return 0;
@@ -3770,7 +3770,8 @@ static int sd_resume(struct device *dev)
                return 0;

        sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
-       ret = sd_start_stop_device(sdkp, 1);
+       if (!sdkp->device->no_start_on_resume)
+               ret = sd_start_stop_device(sdkp, 1);
        if (!ret)
                opal_unlock_from_suspend(sdkp->opal_dev);
        return ret;
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 3113471ca375..92e141536c6c 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -192,6 +192,7 @@ struct scsi_device {
        unsigned no_start_on_add:1;     /* do not issue start on add */
        unsigned allow_restart:1; /* issue START_UNIT in error handler */
        unsigned manage_start_stop:1;   /* Let HLD (sd) manage start/stop */
+       unsigned no_start_on_resume:1; /* Do not issue START_STOP_UNIT on
resume */
        unsigned start_stop_pwr_cond:1; /* Set power cond. in START_STOP_UNIT
*/
        unsigned no_uld_attach:1; /* disable connecting to upper level drivers
*/
        unsigned select_no_atn:1;

I am not sure at all this is correct though. This actually may break other
suspend/resume flavors. If we could somehow synchronize scsi pm resume to
run *after* ata pm resume, all problems should go away I think.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.