On Wed, Jun 14, 2023 at 4:26 PM Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote: > > On Wed, Jun 14, 2023 at 04:35:50PM +0900, Damien Le Moal wrote: > > On 6/14/23 15:57, Hannes Reinecke wrote: > > > On 6/14/23 06:49, Damien Le Moal wrote: > > >> On 6/11/23 18:05, Joe Breuer wrote: > > >>> I'm the reporter of this issue. > > >>> > > >>> I just tried this patch against 6.3.4, and it completely fixes my > > >>> suspend/resume issue. > > >>> > > >>> The optical drive stays usable after resume, even suspending/resuming > > >>> during playback of CDDA content works flawlessly and playback resumes > > >>> seamlessly after system resume. > > >>> > > >>> So, from my perspective: Good one! > > >> > > >> In place of Bart's fix, could you please try this patch ? > > >> > > >> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c > > >> index b80e68000dd3..a81eb4f882ab 100644 > > >> --- a/drivers/ata/libata-eh.c > > >> +++ b/drivers/ata/libata-eh.c > > >> @@ -4006,9 +4006,32 @@ static void ata_eh_handle_port_resume(struct > > >> ata_port *ap) > > >> /* tell ACPI that we're resuming */ > > >> ata_acpi_on_resume(ap); > > >> > > >> - /* update the flags */ > > >> spin_lock_irqsave(ap->lock, flags); > > >> + > > >> + /* Update the flags */ > > >> ap->pflags &= ~(ATA_PFLAG_PM_PENDING | ATA_PFLAG_SUSPENDED); > > >> + > > >> + /* > > >> + * Resuming the port will trigger a rescan of the ATA device(s) > > >> + * connected to it. Before scheduling the rescan, make sure that > > >> + * the associated scsi device(s) are fully resumed as well. > > >> + */ > > >> + ata_for_each_link(link, ap, HOST_FIRST) { > > >> + ata_for_each_dev(dev, link, ENABLED) { > > >> + struct scsi_device *sdev = dev->sdev; > > >> + > > >> + if (!sdev) > > >> + continue; > > >> + if (scsi_device_get(sdev)) > > >> + continue; > > >> + > > >> + spin_unlock_irqrestore(ap->lock, flags); > > >> + device_pm_wait_for_dev(&ap->tdev, > > >> + &sdev->sdev_gendev); > > >> + scsi_device_put(sdev); > > >> + spin_lock_irqsave(ap->lock, flags); > > >> + } > > >> + } > > >> spin_unlock_irqrestore(ap->lock, flags); > > >> } > > >> #endif /* CONFIG_PM */ > > >> > > >> Thanks ! > > >> > > > Well; not sure if that'll work out. > > > The whole reason why we initial a rescan is that we need to check if the > > > ports are still connected, and whether the devices react. > > > So we can't iterate the ports here as this is the very thing which gets > > > checked during EH. > > > > Hmmm... Right. So we need to move that loop into ata_scsi_dev_rescan(), > > which itself already loops over the port devices anyway. > > > > > We really should claim resume to be finished as soon as we can talk with > > > the HBA, and kick off EH asynchronously to let it finish the job after > > > resume has completed. > > > > That is what's done already: > > > > static int ata_port_pm_resume(struct device *dev) > > { > > ata_port_resume_async(to_ata_port(dev), PMSG_RESUME); > > pm_runtime_disable(dev); > > pm_runtime_set_active(dev); > > pm_runtime_enable(dev); > > return 0; > > } > > > > EH is kicked by ata_port_resume_async() -> ata_port_request_pm() and it > > is async. There is no synchronization in EH with the PM side though. We > > probably should have EH check that the port resume is done first, which > > can be done in ata_eh_handle_port_resume() since that is the first thing > > done when entering EH. > > > > The problem remains though that we *must* wait for the scsi device > > resume to be done before calling scsi_rescan_device(), which is done > > asynchronously from EH, as a different work. So that one needs to wait > > for the scsi side resume to be done. > > > > I also thought of trigerring the rescan from the scsi side, but since > > the resume may be asynchronous, we could endup trigerring it with the > > ata side not yet resumed... That would only turn the problem around > > instead of solving it. > > The order in which devices get resumed isn't arbitrary. If the system > is set up not to use async suspends/resumes then the order is always the > same as the order in which the devices were originally registered (for > resume, that is -- suspend obviously takes place in the reverse order). > > So if you're trying to perform an action that requires two devices to be > active, you must not do it in the resume handler for the device that was > registered first. I don't know how the ATA and SCSI pieces interact > here, but regardless, this is a pretty strict requirement. > > It should be okay to perform the action in the resume handler for the > device that was registered second. But if the two devices aren't in an > ancestor-descendant relationship then you also have to call > device_pm_wait_for_dev() (or use device links as Rafael mentioned) to > handle the async case properly. Note that the bare device_pm_wait_for_dev() is a bit risky though, because in the sync case it will deadlock if dpm_list is not ordered properly. One of the things taken care of by device_link_add() is to ensure that the ordering of dpm_list will reflect the dependency represented by the given new device link. > > Or... Why the heck scsi_rescan_device() is calling device_lock() ? This > > is the only place in scsi code I can see that takes this lock. I suspect > > this is to serialize either rescans, or serialize with resume, or both. > > For serializing rescans, we can use another lock. For serializing with > > PM, we should wait for PM transitions... > > Something is not right here. > > Here's what commit e27829dc92e5 ("scsi: serialize ->rescan against > ->remove", written by Christoph Hellwig) says: > > Lock the device embedded in the scsi_device to protect against > concurrent calls to ->remove. > > That's the commit which added the device_lock() call. > > Alan Stern