On Mon, 2023-09-25 at 15:07 -0700, Bart Van Assche wrote: > > External email : Please do not click links or open attachments until > you have verified the sender or the content. > On 9/25/23 01:51, Peter Wang (王信友) wrote: > > On Fri, 2023-09-22 at 09:06 -0700, Bart Van Assche wrote: > >> On 9/22/23 02:09, peter.wang@xxxxxxxxxxxx wrote: > >> > When runtime pm send SSU times out, the SCSI core invokes > >> > eh_host_reset_handler, which hooks function > ufshcd_eh_host_reset_handler > >> > schedule eh_work and stuck at wait flush_work(&hba->eh_work). > >> > However, ufshcd_err_handler hangs in wait rpm resume. > >> > Do link recovery only in this case. > >> > Below is IO hang stack dump in kernel-6.1 > >> > >> What does kernel-6.1 mean? Has commit 7029e2151a7c ("scsi: ufs: > Fix a > >> deadlock between PM and the SCSI error handler") been backported > to > >> that kernel? > > > > Yes, our kernel-6.1 have backport 7029e2151a7c ("scsi: ufs: Fix a > > deadlock between PM and the SCSI error handler") > > Hi Peter, > > Thanks for having confirmed this. > > Please use text mode instead of HTML mode when replying. As one can > see > here, your reply did not make it to the linux-scsi mailing list: > https://lore.kernel.org/linux-scsi/20230922090925.16339-1-peter.wang@xxxxxxxxxxxx/ > > > And this IO hang issue still happen. > > > > This issue is easy trigger if we drop the SSU response to let SSU > timeout. > > > > > > > >> > >> > diff --git a/drivers/ufs/core/ufshcd.c > b/drivers/ufs/core/ufshcd.c > >> > index c2df07545f96..7608d75bb4fe 100644 > >> > --- a/drivers/ufs/core/ufshcd.c > >> > +++ b/drivers/ufs/core/ufshcd.c > >> > @@ -7713,9 +7713,29 @@ static int > ufshcd_eh_host_reset_handler(struct scsi_cmnd *cmd) > >> > int err = SUCCESS; > >> > unsigned long flags; > >> > struct ufs_hba *hba; > >> > +struct device *dev; > >> > > >> > hba = shost_priv(cmd->device->host); > >> > > >> > +/* > >> > + * If runtime pm send SSU and got timeout, scsi_error_handler > >> > + * stuck at this function to wait flush_work(&hba->eh_work). > >> > + * And ufshcd_err_handler(eh_work) stuck at wait runtime pm > active. > >> > + * Do ufshcd_link_recovery instead shedule eh_work can prevent > >> > + * dead lock happen. > >> > + */ > >> > +dev = &hba->ufs_device_wlun->sdev_gendev; > >> > +if ((dev->power.runtime_status == RPM_RESUMING) || > >> > +(dev->power.runtime_status == RPM_SUSPENDING)) { > >> > +err = ufshcd_link_recovery(hba); > >> > +if (err) { > >> > +dev_err(hba->dev, "WL Device PM: status:%d, err:%d\n", > >> > +dev->power.runtime_status, > >> > +dev->power.runtime_error); > >> > +} > >> > +return err; > >> > +} > >> > + > >> > spin_lock_irqsave(hba->host->host_lock, flags); > >> > hba->force_reset = true; > >> > ufshcd_schedule_eh_work(hba); > >> > >> I think this change is racy because the runtime power management > status > >> may change after the above checks have been performed and before > >> ufshcd_err_handling_prepare() is called. > > > > If this function invoke and RPM state is in the RPM_RESUMING or > RPM_SUSPENDING state, > > it means error happen in ufs driver resume/suspend callback > function and no chance > > move to next state if callback function is not finished, just like > backtrace stuck in send SSU cmd. > > Or we can make it simple by check pm_op_in_progress, what do you > think? > > I'm concerned that this approach will fix some cases but not all > cases. The > UFS error handler (ufshcd_err_handler()) and runtime suspend or > resume code > may be called concurrently. No matter which checks are added at the > start of > ufshcd_eh_host_reset_handler(), I think these checks won't protect > against > concurrent execution of runtime power management code just after > these checks > have finished. > Hi Bart, Yes, but ufshcd_err_handler will wait runtime pm resume to actvie if concurrently run with runtime suspend or resume. (ufshcd_err_handler -> ufshcd_err_handling_prepare -> ufshcd_rpm_get_sync) So, if runtime suspend or resume send SSU timeout, scsi error handler stuck at this function and deadlock happen. The ideas is, if runtime pm flow get error, do ufshcd_link_recovery is enough. > > Since the hang db backtrace is not involde below function, it is > not help. > > I think that conclusion is wrong. Can you please test the patch that > I provided? > Yes, I have check this patch is include in our code base, and io hang still happen. The stack dump of this deadlock can tell how it cause io hang. Thanks. > Thanks, > > Bart.