On 6/16/23 12:32, Kai-Heng Feng wrote: > On Thu, Jun 15, 2023 at 10:50 PM Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> wrote: >> >> On Thu, Jun 15, 2023 at 05:33:26PM +0900, Damien Le Moal wrote: >>> When an ATA port is resumed from sleep, the port is reset and a power >>> management request issued to libata EH to reset the port and rescanning >>> the device(s) attached to the port. Device rescanning is done by >>> scheduling an ata_scsi_dev_rescan() work, which will execute >>> scsi_rescan_device(). >>> >>> However, scsi_rescan_device() takes the generic device lock, which is >>> also taken by dpm_resume() when the SCSI device is resumed as well. If >>> a device rescan execution starts before the completion of the SCSI >>> device resume, the rcu locking used to refresh the cached VPD pages of >>> the device, combined with the generic device locking from >>> scsi_rescan_device() and from dpm_resume() can cause a deadlock. >>> >>> Avoid this situation by changing struct ata_port scsi_rescan_task to be >>> a delayed work instead of a simple work_struct. ata_scsi_dev_rescan() is >>> modified to check if the SCSI device associated with the ATA device that >>> must be rescanned is not suspended. If the SCSI device is still >>> suspended, ata_scsi_dev_rescan() returns early and reschedule itself for >>> execution after an arbitrary delay of 5ms. >> >> I don't understand the nature of the relationship between the ATA port >> and the corresponding SCSI device. Maybe you could explain it more >> fully, if you have time. >> >> But in any case, this approach seems like a layering violation. Why not >> instead call a SCSI utility routine to set a "needs_rescan" flag in the >> scsi_device structure? Then scsi_device_resume() could automatically >> call scsi_rescan_device() -- or rather an internal version that assumes >> the device lock is already held -- if the flag is set. Or it could >> queue a non-delayed work routine to do this. (Is it important to have >> the rescan finish before userspace starts up and tries to access the ATA >> device again?) >> >> That, combined with a guaranteed order of resuming, would do what you >> want, right? > > What you are suggesting is pretty much like my previous approach: > https://lore.kernel.org/all/20230502150435.423770-2-kai.heng.feng@xxxxxxxxxxxxx/ Not really. We need more than what you did. See my reply to Alan. Your solution is rather similar to what I did but it was delaying the rescan after the entire system is resumed (pm_suspend_target_state != PM_SUSPEND_ON), which is really a heavy hammer and would significantly slow down resuming. > > Kai-Heng > >> >> Alan Stern -- Damien Le Moal Western Digital Research