Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests

Can Guo <cang@xxxxxxxxxxxxxx> · Tue, 15 Jun 2021 10:36:56 +0800

Hi Bart,

On 2021-06-15 02:49, Bart Van Assche wrote:
On 6/13/21 7:42 AM, Can Guo wrote:
2. ufshcd_abort() invokes ufshcd_err_handler() synchronously can have 
a
live lock issue, which is why I chose the asynchronous way (from the 
first
day I started to fix error handling). The live lock happens when abort
happens
to a PM request, e.g., a SSU cmd sent from suspend/resume. Because UFS
error
handler is synchronized with suspend/resume (by calling
pm_runtime_get_sync()
and lock_system_sleep()), the sequence is like:
[1] ufshcd_wl_resume() sends SSU cmd
[2] ufshcd_abort() calls UFS error handler
[3] UFS error handler calls lock_system_sleep() and 
pm_runtime_get_sync()

In above sequence, either lock_system_sleep() or pm_runtime_get_sync()
shall
be blocked - [3] is blocked by [1], [2] is blocked by [3], while [1] 
is
blocked by [2].

For PM requests, I chose to abort them fast to unblock suspend/resume,
suspend/resume shall fail of course, but UFS error handler recovers
PM errors anyways.

In the above sequence, does [2] perhaps refer to aborting the SSU
command submitted in step [1] (this is not clear to me)?

Yes, your understanding is right.

If so, how about breaking the circular waiting cycle as follows:
- If it can happen that SSU succeeds after more than scsi_timeout
  seconds, define a custom timeout handler. From inside the timeout
  handler, schedule a link check and return BLK_EH_RESET_TIMER. If the
  link is no longer operational, run the error handler. If the link
  cannot be recovered by the error handler, fail all pending commands.
  This will prevent that ufshcd_abort() is called if a SSU command 
takes
  longer than expected. See also commit 0dd0dec1677e.
- Modify the UFS error handler such that it accepts a context argument.
  The context argument specifies whether or not the UFS error handler 
is
  called from inside a system suspend or system resume handler. If the
  UFS error handler is called from inside a system suspend or resume
  callback, skip the lock_system_sleep() and unlock_system_sleep()
  calls.

I am aware of commit 0dd0dec1677e, I gave my reviewed-by tag. Thank you
for your suggestion and I believe it can resolve the cycle, because 
actually
I've considered the similar way (leverage hba->host->eh_noresume) last 
year,
but I didn't take this way due to below reasons:

1. UFS error handler basically does one thing - reset and restore, which
stops hba [1], resets device [2] and re-probes the device [3]. Stopping 
hba [1]
shall complete any pending requests in the doorbell (with error or no 
error).
After [1], suspend/resume contexts, blocked by SSU cmd, shall be 
unblocked
right away to do whatever it needs to handle the SSU cmd failure 
(completed
in [1], so scsi_execute() returns an error), e.g., put link back to the 
old
state. call ufshcd_vops_suspend(), turn off irq/clocks/powers and etc...
However, reset and restore ([2] and [3]) is still running, and it can 
(most likely)
be disturbed by suspend/resume. So passing a parameter or using 
hba->host->eh_noresume
to skip lock_system_sleep() and unlock_system_sleep() can break the 
cycle,
but error handling may run concurrently with suspend/resume. Of course 
we can
modify suspend/resume to avoid it, but I was pursuing a minimal change 
to get this fixed.

2. Whatever way we take to break the cycle, suspend/resume shall fail 
and
RPM framework shall save the error to dev.power.runtime_error, leaving
the device in runtime suspended or active mode permanently. If it is 
left
runtime suspended, UFS driver won't accept cmd anymore, while if it is 
left
runtime active, powers of UFS device and host will be left ON, leading 
to power
penalty. So my main idea is to let suspend/resume contexts, blocked by 
PM cmds,
fail fast first and then error handler recover everything back to work.

Thanks,

Can Guo.

Thanks,

Bart.