On 8/26/22 07:15, bugzilla-daemon@xxxxxxxxxx wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=215880 > > --- Comment #38 from damien.lemoal@xxxxxxxxxxxxxxxxxx --- > On 8/26/22 05:31, bugzilla-daemon@xxxxxxxxxx wrote: >> https://bugzilla.kernel.org/show_bug.cgi?id=215880 >> >> --- Comment #37 from Bart Van Assche (bvanassche@xxxxxxx) --- >> On 8/25/22 13:01, bugzilla-daemon@xxxxxxxxxx wrote: >>> https://bugzilla.kernel.org/show_bug.cgi?id=215880 >>> >>> --- Comment #36 from jason600 (jason600.groome@xxxxxxxxx) --- >>> (In reply to Bart Van Assche from comment #27) >>>> Thanks for testing! The patches from the sd-resume branch have been posted >>>> on the linux-scsi mailing list. See also >>>> >>>> >>>> https://lore.kernel.org/linux-scsi/20220628222131.14780-1-bvanassche@xxxxxxx/ >>>> T/#t >>> >>> Hi Bart, just an update for you. I noticed this had been removed from the >>> 6.0-rc1 for freezing after suspend. >>> >>> I've been compiling my kernel with this fix on various 5.18 kernels (with >>> opensuse tumbleweed), it has worked fine, no freezing on resume as others >>> have >>> mentioned. >>> >>> Yesterday, I updated to 5.19.2 kernel, applied the fix, recompiled, and it >>> froze after the first suspend. Rebooted and the same thing happened again. I >>> recompiled the kernel with the fix, just to make sure i didn't mess it up, >>> and >>> the same happened again. >>> >>> When you originally did this fix, you based it on 5.18, and indeed, it works >>> fine on 5.18 for me. There were a lot of changes to the drivers/scsi/sd.c >>> file >>> for 5.19, presumably it was those changes that made this fix start freezing >>> after suspend. >>> >>> Perhaps you could check if the other people that experienced freezing were >>> using either 5.19 or 6.0-rc1. >> >> Multiple people reported issues with freezes during suspend with kernel >> v6.0-rc1. Please take a look at the following report: >> https://lore.kernel.org/all/dd6844e7-f338-a4e9-2dad-0960e25b2ca1@xxxxxxxxxx/. >> It shows that if zoned ATA disks are present that blk_mq_freeze_queue() >> may be called from inside ata_scsi_dev_rescan() on the context of a work >> queue. ATA rescanning happens from inside the SCSI error handler. So >> there is potential for a lockup because of the following: >> * Execution of the START command being postponed because the SCSI error >> handler is active. >> * blk_mq_freeze_queue() waiting for the START command to finish. >> * The START completion handler not being executed because it got queued >> on the same work queue as the ATA rescan work. Checking the code, the dev pm resume call chain for ATA look like this. ata_port_resume() -> ata_port_request_pm() -> ata_port_schedule_eh() -> scsi_error_handler() thread runs -> shost->transportt->eh_strategy_handler == ata_scsi_error() -> ata_scsi_port_error_handler() -> ata_eh_handle_port_resume() -> ap->ops->port_resume() == ahci_port_resume() for AHCI adapters. There are no commands issued to the device by this chain, only registers/port settings being changed. So this should always complete quickly and in itself not be the reason for the START_STOP command issued by the sd driver to get stuck. After calling ata_eh_handle_port_resume(), ata_scsi_port_error_handler() will trigger a reset through ata_std_error_handler() -> ata_do_eh() and that will cause the device rescan in EH context. Device rescan will definitely spin up the device so the START_STOP command that sd_resume() issues seems rather useless... As a hack, it would be good to try something like this: diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index 29e2f55c6faa..1bc92c04f048 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -1081,6 +1081,7 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct ata_device *dev) } else { sdev->sector_size = ata_id_logical_sector_size(dev->id); sdev->manage_start_stop = 1; + sdev->no_start_on_resume = 1; } /* diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 8f79fa6318fe..4c28ca4d038b 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3761,7 +3761,7 @@ static int sd_suspend_runtime(struct device *dev) static int sd_resume(struct device *dev) { struct scsi_disk *sdkp = dev_get_drvdata(dev); - int ret; + int ret = 0; if (!sdkp) /* E.g.: runtime resume at the start of sd_probe() */ return 0; @@ -3770,7 +3770,8 @@ static int sd_resume(struct device *dev) return 0; sd_printk(KERN_NOTICE, sdkp, "Starting disk\n"); - ret = sd_start_stop_device(sdkp, 1); + if (!sdkp->device->no_start_on_resume) + ret = sd_start_stop_device(sdkp, 1); if (!ret) opal_unlock_from_suspend(sdkp->opal_dev); return ret; diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index 3113471ca375..92e141536c6c 100644 --- a/include/scsi/scsi_device.h +++ b/include/scsi/scsi_device.h @@ -192,6 +192,7 @@ struct scsi_device { unsigned no_start_on_add:1; /* do not issue start on add */ unsigned allow_restart:1; /* issue START_UNIT in error handler */ unsigned manage_start_stop:1; /* Let HLD (sd) manage start/stop */ + unsigned no_start_on_resume:1; /* Do not issue START_STOP_UNIT on resume */ unsigned start_stop_pwr_cond:1; /* Set power cond. in START_STOP_UNIT */ unsigned no_uld_attach:1; /* disable connecting to upper level drivers */ unsigned select_no_atn:1; I am not sure at all this is correct though. This actually may break other suspend/resume flavors. If we could somehow synchronize scsi pm resume to run *after* ata pm resume, all problems should go away I think. -- Damien Le Moal Western Digital Research