On 2021-06-16 02:25, Bart Van Assche wrote:
On 6/14/21 7:36 PM, Can Guo wrote:
I've considered the similar way (leverage hba->host->eh_noresume) last
year,
but I didn't take this way due to below reasons:
1. UFS error handler basically does one thing - reset and restore,
which
stops hba [1], resets device [2] and re-probes the device [3].
Stopping
hba [1]
shall complete any pending requests in the doorbell (with error or no
error).
After [1], suspend/resume contexts, blocked by SSU cmd, shall be
unblocked
right away to do whatever it needs to handle the SSU cmd failure
(completed
in [1], so scsi_execute() returns an error), e.g., put link back to
the old
state. call ufshcd_vops_suspend(), turn off irq/clocks/powers and
etc...
However, reset and restore ([2] and [3]) is still running, and it can
(most likely)
be disturbed by suspend/resume. So passing a parameter or using
hba->host->eh_noresume
to skip lock_system_sleep() and unlock_system_sleep() can break the
cycle,
but error handling may run concurrently with suspend/resume. Of course
we can
modify suspend/resume to avoid it, but I was pursuing a minimal change
to get this fixed.
2. Whatever way we take to break the cycle, suspend/resume shall fail
and
RPM framework shall save the error to dev.power.runtime_error, leaving
the device in runtime suspended or active mode permanently. If it is
left
runtime suspended, UFS driver won't accept cmd anymore, while if it is
left
runtime active, powers of UFS device and host will be left ON, leading
to power
penalty. So my main idea is to let suspend/resume contexts, blocked by
PM cmds,
fail fast first and then error handler recover everything back to
work.
Hi Can,
Has it been considered to make the UFS error handler fail pending
commands with an error code that causes the SCSI core to resubmit the
SCSI command, e.g. DID_IMM_RETRY or DID_TRANSPORT_DISRUPTED? I want to
prevent that power management or suspend/resume callbacks fail if the
error handler succeeds with recovering the UFS transport.
Hi Bart,
Thanks for the suggestion, I thought about it but I didn't go that
far in this path because I believe letting a context fast fail is
better than retrying/blocking it (to me suspend/resume can fail
due to many reasons and task abort is just one of them). I appreciate
the idea, but I would like to stick to my way as of now because
1. Merely preventing task abort cannot prevent suspend/resume fail.
Task abort (to PM requests), in real cases, is just one of many kinds
of failure which can fail the suspend/resume callbacks. During
suspend/resume, if AH8 error and/or UIC errors happen, IRQ handler
may complete SSU cmd with errors and schedule the error handler (I've
seen such scenarios in real customer cases). My idea is to treat task
abort (to PM requests) as a failure (let scsi_execute() return with
whatever error) and let error handler recover everything just like
any other UFS errors which invoke error handler. In case this, again,
goes back to the topic that is why don't just do error recovery in
suspend/resume, let me paste my previous reply here -
"
Error handler has the same nature of user access - it is unpredictable,
meaning it
can be invoked at any time (from IRQ handler), even when there is no
ongoing
cmd/data transactions (like auto hibern8 failure and UIC errors, such as
DME
error and some errors in data link layer) [1], unless you disable UFS
IRQ.
The reasons why I choose not to do it that way are (althrough error
handler
prepare has became much more simple after apply this change)
- I want to keep all the complexity within error handler, and re-direct
all error
recovery needs to error handler. It can avoid calling
ufshcd_reset_and_restore()
and/or flush_work(&hba->eh_work) here and there. The entire UFS
suspend/resume is
already complex enough, I don't want to mess up with it.
- We do explicit recovery only when we see certain errors, e.g., H8
enter func
returns an error during suspend, but as mentioned above [1], error
handling can
be invoked already from IRQ handler (due to all kinds of UIC errors
before H8 enter
func returns). So, we still need host_sem (in case of system
suspend/resume) to
avoid concurrency.
- During system suspend/resume, error handling can be invoked (due to
non-fatal
errors) but still UFS cmds return no error at all. Similar like above,
we need
host_sem to avoid concurrency.
"
2. And say we want SCSI layer to resubmit PM requests to prevent
suspend/resume fail, we should keep retrying the PM requests (so
long as error handler can recover everything successfully), meaning
we should give them unlimited retries (which I think is a bad idea),
otherwise (if they have zero retries or limited retries), in extreme
conditions, what may happen is that error handler can recover everything
successfully every time, but all these retries (say 3) still time out,
which block the power management for too long (retries * 60 seconds)
and,
most important, when the last retry times out, scsi layer will anyways
complete the PM request (even we return DID_IMM_RETRY), then we end up
same - suspend/resume shall run concurrently with error handler and we
couldn't recover saved PM errors.
Thanks,
Can Guo.
Thanks,
Bart.