Re: [PATCH] ata: libata-scsi: Avoid deadlock on rescan after device resume

Alan Stern <stern@xxxxxxxxxxxxxxxxxxx> · Fri, 16 Jun 2023 10:25:22 -0400

On Fri, Jun 16, 2023 at 02:25:38PM +0900, Damien Le Moal wrote:
> On 6/15/23 23:50, Alan Stern wrote:
> > On Thu, Jun 15, 2023 at 05:33:26PM +0900, Damien Le Moal wrote:
> >> When an ATA port is resumed from sleep, the port is reset and a power
> >> management request issued to libata EH to reset the port and rescanning
> >> the device(s) attached to the port. Device rescanning is done by
> >> scheduling an ata_scsi_dev_rescan() work, which will execute
> >> scsi_rescan_device().
> >>
> >> However, scsi_rescan_device() takes the generic device lock, which is
> >> also taken by dpm_resume() when the SCSI device is resumed as well. If
> >> a device rescan execution starts before the completion of the SCSI
> >> device resume, the rcu locking used to refresh the cached VPD pages of
> >> the device, combined with the generic device locking from
> >> scsi_rescan_device() and from dpm_resume() can cause a deadlock.
> >>
> >> Avoid this situation by changing struct ata_port scsi_rescan_task to be
> >> a delayed work instead of a simple work_struct. ata_scsi_dev_rescan() is
> >> modified to check if the SCSI device associated with the ATA device that
> >> must be rescanned is not suspended. If the SCSI device is still
> >> suspended, ata_scsi_dev_rescan() returns early and reschedule itself for
> >> execution after an arbitrary delay of 5ms.
> > 
> > I don't understand the nature of the relationship between the ATA port
> > and the corresponding SCSI device.  Maybe you could explain it more
> > fully, if you have time.
> 
> For ata devices, the parent -> child relationship is:
> 
> ata_host (the adapter) -> ata_port -> ata_link -> ata_device (HDD, SSD or ATAPI
> CD/DVD)
> 
> For scsi devices representing ATA devices, it is:
> 
> ata_port -> scsi_host -> scsi_target -> scsi_device -> scsi_disk (or gendisk for
> a CD/DVD)
> 
> When devices are scanned, libata will create ports and create a scsi_host for
> each port, and a scsi device for each ata_device found on the link(s) for the
> port. There is no direct relationship between an ata_device (the HDD or SSD) and
> its scsi_device/scsi_disk (the device used to issue commands). The PM operations
> we have are for ata_port and scsi_device. For the scsi device, the operations
> are actually defined per device type, so in the scsi_disk driver (sd) for HDDs
> and SSDs.

Thanks for the explanation.  So there are two device structures in the 
kernel representing the same physical piece of hardware.  No wonder you 
sometimes run into trouble!

On a more positive note, it sounds like each ata_device is always 
created and registered before the corresponding scsi_device, which means 
that synchronous suspends and resumes are guaranteed to work the way you 
want.

> > But in any case, this approach seems like a layering violation.  Why not 
> 
> The layering violation is I think only with the direct reference to the scsi
> device power.is_suspended field, which is definitely not pretty. But there are
> some other drivers doing something similar:
> 
> $ git grep "power\.is_suspended" | grep -v drivers/base/power/main.c
> drivers/gpu/drm/i915/display/intel_display_power_well.c:	if
> (!dev_priv->drm.dev->power.is_suspended)
> drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c:	if
> (!dwmac->dev->power.is_suspended) {
> drivers/platform/surface/surface_acpi_notify.c:	if (d->dev->power.is_suspended) {
> 
> All the other code (ata calling scsi) is normal per the SCSI-to-ata translation
> needed (all ata devices are represented as scsi devices in Linux, following the
> SAT=scsi ATA translation specifications).
> 
> > instead call a SCSI utility routine to set a "needs_rescan" flag in the 
> > scsi_device structure?  Then scsi_device_resume() could automatically 
> > call scsi_rescan_device() -- or rather an internal version that assumes 
> > the device lock is already held -- if the flag is set.  Or it could 
> 
> Yes, ideally, that is what we should do. Such fix is however more involved, and
> so I prefer not to push for this right now as a fix for the regression at hand.
> But I will definitively look into this.
> 
> > queue a non-delayed work routine to do this.  (Is it important to have 
> > the rescan finish before userspace starts up and tries to access the ATA 
> > device again?)
> > 
> > That, combined with a guaranteed order of resuming, would do what you 
> > want, right?
> 
> Yes. But again more fixes needed:
> 1) libata uses its error handling path to reset a port on resume and probe the
> links again. The devices found are then revalidated (IDENTIFY command, reading
> log pages etc). This call to EH is triggered in the pm->resume operation, but EH
> runs as an asynchronous task. So the port->resume may complete before EH is
> done. We need to change the EH call to be synchronous in ->resume
> 2) We need to remove triggering the task that does scsi_rescan_device() in EH
> and move that trigger to scsi_device ->resume.
> 3) Modify scsi_device ->resume() to call scsi_rescan_device()
> 
> Safely doing (3) requires synchronization between ata_port->resume and
> scsi_device->resume. We can do that by adding a device link between ata_device
> and scsi_device. Doing so, the scsi device becomes the grandchild of the
> ata_port and we are guaranteed that its ->resume will happen only once the ata
> port ->resume is done.

That change should be made in any case.  SCSI device drivers assume they 
can send PM-related commands to the device as part of carrying out the 
resume procedure.  (I don't remember whether they actually do so, but 
they are allowed to make this assumption.)  Obviously such commands 
won't work unless the ata_port is fully resumed first.

> That will also improve things as we will be able to rescan the scsi device (and
> thus catch any change to the device) *before* the device ->resume operation
> re-enables issuing user commands by un-quiescing the device request queue.

Just so.

> As you can see, that is all beyond a quick fix for a regression... Will work on
> this.

It seems like a good plan of approach.

Alan Stern