Re: [PATCH] scsi: ufs: Fix deadlocks between power management and error handler

Adrian Hunter <adrian.hunter@xxxxxxxxx> · Mon, 19 Sep 2022 20:21:07 +0300

On 19/09/22 16:54, Bart Van Assche wrote:
> On 9/19/22 04:34, Adrian Hunter wrote:
>> Did you consider something like:
>>
>> diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
>> index 7256e6c43ca6..dc83b38dfde9 100644
>> --- a/drivers/ufs/core/ufshcd.c
>> +++ b/drivers/ufs/core/ufshcd.c
>> @@ -7374,6 +7374,9 @@ static int ufshcd_eh_host_reset_handler(struct scsi_cmnd *cmd)
>>         hba = shost_priv(cmd->device->host);
>>   +    if (hba->pm_op_in_progress)
>> +        return FAST_IO_FAIL;
>> +
>>       spin_lock_irqsave(hba->host->host_lock, flags);
>>       hba->force_reset = true;
>>       ufshcd_schedule_eh_work(hba);
> 
> The above change could cause error handling to be skipped if an error happened that requires the link to be reset. That seems wrong to me.

Hopefully a PM op with an error that really needed a host reset
would show up as a UFS error that the error handler could fix
successfully.

Alternatively, the same change but after scheduling the error handler?

> 
>> The original commit for host_sem was aimed at sysfs (see commit below).
>> Did you consider how sysfs access is affected?
>>
>>    commit 9cd20d3f473619d8d482551d15d4cebfb3ce73c8
>>    Author: Can Guo <cang@xxxxxxxxxxxxxx>
>>    Date:   Wed Jan 13 19:13:28 2021 -0800
>>
>>      scsi: ufs: Protect PM ops and err_handler from user access through sysfs
>>           User layer may access sysfs nodes when system PM ops or error handling is
>>      running. This can cause various problems. Rename eh_sem to host_sem and use
>>      it to protect PM ops and error handling from user layer intervention.
> 
> The sysfs and debugfs attribute callback methods already call pm_runtime_get_sync() and pm_runtime_put_sync() so how could the power state change while a sysfs or debugfs attribute callback method is in progress?

Without PM holding host_sem, maybe it would give a similar
deadlock to what was described:

ufs_sysfs_read_desc_param
down(&hba->host_sem); <------------------------------------
ufshcd_rpm_get_sync(hba);
	waits for blk_execute_rq()
	waits for ufshcd_eh_host_reset_handler()
	waits for ufshcd_err_handler()
	waits for down(&hba->host_sem); <------------------

> 
>>> The ufshcd_rpm_get_sync() call at the start of
>>> ufshcd_err_handling_prepare() may deadlock since calling scsi_execute()
>>> is required by the UFS runtime resume implementation. Fixing that
>>> deadlock falls outside the scope of this patch.
>>
>> Do you mean:
>>
>> static void ufshcd_err_handling_prepare(struct ufs_hba *hba)
>> {
>>     ufshcd_rpm_get_sync(hba);
>>
>> because that is the host controller, not the UFS device, that is
>> being resumed.
> 
> Hmm ... I think that ufshcd_rpm_get_sync() affects the power state of the UFS device and not the power state of the UFS host controller. From ufshcd-priv.h:
> 
> static inline int ufshcd_rpm_get_sync(struct ufs_hba *hba)
> {
>     return pm_runtime_get_sync(&hba->ufs_device_wlun->sdev_gendev);
> }

Yes, I misread that, sorry.

I guess it goes unnoticed because it is very unlikely i.e. the UFS
device would need to be suspending but not yet have claimed host_sem.
There would not be any outstanding requests otherwise the suspend
would not have started, so chance of errors at that point is very low.

Maybe deadlock could be sidestepped by changing:

diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index 7256e6c43ca6..9cb04c6f8dc3 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -9258,7 +9261,10 @@ static int ufshcd_wl_suspend(struct device *dev)
 	ktime_t start = ktime_get();
 
 	hba = shost_priv(sdev->host);
-	down(&hba->host_sem);
+	if (down_trylock(&hba->host_sem)) {
+		ret = -EBUSY;
+		goto out;
+	}
 
 	if (pm_runtime_suspended(dev))
 		goto out;