On 24/06/21 9:31 am, Can Guo wrote: > On 2021-06-24 14:23, Adrian Hunter wrote: >> On 24/06/21 9:12 am, Can Guo wrote: >>> On 2021-06-24 13:52, Adrian Hunter wrote: >>>> On 24/06/21 5:16 am, Can Guo wrote: >>>>> On 2021-06-23 22:30, Adrian Hunter wrote: >>>>>> On 23/06/21 10:35 am, Can Guo wrote: >>>>>>> To protect system suspend/resume from being disturbed by error handling, >>>>>>> instead of using host_sem, let error handler call lock_system_sleep() and >>>>>>> unlock_system_sleep() which achieve the same purpose. Remove the host_sem >>>>>>> used in suspend/resume paths to make the code more readable. >>>>>>> >>>>>>> Suggested-by: Bart Van Assche <bvanassche@xxxxxxx> >>>>>>> Signed-off-by: Can Guo <cang@xxxxxxxxxxxxxx> >>>>>>> --- >>>>>>> drivers/scsi/ufs/ufshcd.c | 12 +++++++----- >>>>>>> 1 file changed, 7 insertions(+), 5 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c >>>>>>> index 3695dd2..a09e4a2 100644 >>>>>>> --- a/drivers/scsi/ufs/ufshcd.c >>>>>>> +++ b/drivers/scsi/ufs/ufshcd.c >>>>>>> @@ -5907,6 +5907,11 @@ static void ufshcd_clk_scaling_suspend(struct ufs_hba *hba, bool suspend) >>>>>>> >>>>>>> static void ufshcd_err_handling_prepare(struct ufs_hba *hba) >>>>>>> { >>>>>>> + /* >>>>>>> + * It is not safe to perform error handling while suspend or resume is >>>>>>> + * in progress. Hence the lock_system_sleep() call. >>>>>>> + */ >>>>>>> + lock_system_sleep(); >>>>>> >>>>>> It looks to me like the system takes this lock quite early, even before >>>>>> freezing tasks, so if anything needs the error handler to run it will >>>>>> deadlock. >>>>> >>>>> Hi Adrian, >>>>> >>>>> UFS/hba system suspend/resume does not invoke or call error handling in a >>>>> synchronous way. So, whatever UFS errors (which schedules the error handler) >>>>> happens during suspend/resume, error handler will just wait here till system >>>>> suspend/resume release the lock. Hence no worries of deadlock here. >>>> >>>> It looks to me like the state can change to UFSHCD_STATE_EH_SCHEDULED_FATAL >>>> and since user processes are not frozen, nor file systems sync'ed, everything >>>> is going to deadlock. >>>> i.e. >>>> I/O is blocked waiting on error handling >>>> error handling is blocked waiting on lock_system_sleep() >>>> suspend is blocked waiting on I/O >>>> >>> >>> Hi Adrian, >>> >>> First of all, enter_state(suspend_state_t state) uses mutex_trylock(&system_transition_mutex). >> >> Yes, in the case I am outlining it gets the mutex. >> >>> Second, even that happens, in ufshcd_queuecommand(), below logic will break the cycle, by >>> fast failing the PM request (below codes are from the code tip with this whole series applied). >> >> It won't get that far because the suspend will be waiting to sync filesystems. >> Filesystems will be waiting on I/O. >> I/O will be waiting on the error handler. >> The error handler will be waiting on system_transition_mutex. >> But system_transition_mutex is already held by PM core. > > Hi Adrian, > > You are right.... I missed the action of syncing filesystems... > > Using back host_sem in suspend_prepare()/resume_complete() won't have this > problem of deadlock, right? I am not sure, but what was problem that the V3 patch was fixing? Can you give an example? > > Thanks, > > Can Guo. > >> >>> >>> case UFSHCD_STATE_EH_SCHEDULED_FATAL: >>> /* >>> * ufshcd_rpm_get_sync() is used at error handling preparation >>> * stage. If a scsi cmd, e.g., the SSU cmd, is sent from the >>> * PM ops, it can never be finished if we let SCSI layer keep >>> * retrying it, which gets err handler stuck forever. Neither >>> * can we let the scsi cmd pass through, because UFS is in bad >>> * state, the scsi cmd may eventually time out, which will get >>> * err handler blocked for too long. So, just fail the scsi cmd >>> * sent from PM ops, err handler can recover PM error anyways. >>> */ >>> if (cmd->request->rq_flags & RQF_PM) { >>> hba->force_reset = true; >>> set_host_byte(cmd, DID_BAD_TARGET); >>> cmd->scsi_done(cmd); >>> goto out; >>> } >>> fallthrough; >>> case UFSHCD_STATE_RESET: >>> >>> Thanks, >>> >>> Can Guo. >>> >>>>> >>>>> Thanks, >>>>> >>>>> Can Guo. >>>>> >>>>>> >>>>>>> ufshcd_rpm_get_sync(hba); >>>>>>> if (pm_runtime_status_suspended(&hba->sdev_ufs_device->sdev_gendev) || >>>>>>> hba->is_wlu_sys_suspended) { >>>>>>> @@ -5951,6 +5956,7 @@ static void ufshcd_err_handling_unprepare(struct ufs_hba *hba) >>>>>>> ufshcd_clk_scaling_suspend(hba, false); >>>>>>> ufshcd_clear_ua_wluns(hba); >>>>>>> ufshcd_rpm_put(hba); >>>>>>> + unlock_system_sleep(); >>>>>>> } >>>>>>> >>>>>>> static inline bool ufshcd_err_handling_should_stop(struct ufs_hba *hba) >>>>>>> @@ -9053,16 +9059,13 @@ static int ufshcd_wl_suspend(struct device *dev) >>>>>>> ktime_t start = ktime_get(); >>>>>>> >>>>>>> hba = shost_priv(sdev->host); >>>>>>> - down(&hba->host_sem); >>>>>>> >>>>>>> if (pm_runtime_suspended(dev)) >>>>>>> goto out; >>>>>>> >>>>>>> ret = __ufshcd_wl_suspend(hba, UFS_SYSTEM_PM); >>>>>>> - if (ret) { >>>>>>> + if (ret) >>>>>>> dev_err(&sdev->sdev_gendev, "%s failed: %d\n", __func__, ret); >>>>>>> - up(&hba->host_sem); >>>>>>> - } >>>>>>> >>>>>>> out: >>>>>>> if (!ret) >>>>>>> @@ -9095,7 +9098,6 @@ static int ufshcd_wl_resume(struct device *dev) >>>>>>> hba->curr_dev_pwr_mode, hba->uic_link_state); >>>>>>> if (!ret) >>>>>>> hba->is_wlu_sys_suspended = false; >>>>>>> - up(&hba->host_sem); >>>>>>> return ret; >>>>>>> } >>>>>>> #endif >>>>>>>