Hi Bart,
On 2020-07-14 12:26, Can Guo wrote:
Hi Bart,
On 2020-07-14 11:52, Bart Van Assche wrote:
On 2020-07-13 19:28, Can Guo wrote:
o Queue eh_work on a single threaded workqueue to avoid concurrency
between
eh_works.
Please use another approach (mutex?) to serialize error handling.
There are
already way too workqueues in a running Linux system.
Yeah, mutex works, but in this change, we need to flush the eh_work. As
per
test, in real cases, flush_work can trigger warnings if the work is
queued on
system_wq. Please check func check_flush_dependency().
o According to the UFSHCI JEDEC spec, hibern8 enter/exit error occurs
when
the link is broken. This actaully applies to any power mode change
operations. In this change, if a power mode change operation
(including
AH8 enter/exit) fails, mark the link state as UIC_LINK_BROKEN_STATE
and
schedule eh_work. eh_work needs to do full reset and restore to
recover
the link back to active. Before the link state is recovered to
active by
eh_work, any power mode change attempts just return -ENOLINK to
avoid
consecutive HW error.
o To avoid concurrency between eh_work and link recovery, remove link
recovery from hibern8 enter/exit func. If hibern8 enter/exit func
fails,
simply return error code and let eh_work run in parallel.
o Recover UFS hba runtime PM error in eh_work. If
ufschd_suspend/resume
fails due to UFS error, e.g. hibern8 enter/exit error and SSU cmd
error,
the runtime PM framework saves the error to
dev.power.runtime_error.
After that, hba runtime suspend/resume would not be invoked anymore
until
dev.power.runtime_error is cleared. The runtime PM error can be
recovered
in eh_work by calling pm_runtime_set_active() after reset and
restore
succeeds. Meanwhile, if pm_runtime_set_active() returns no error,
which
means dev.power.runtime_error is cleared, we also need to
explicitly
resume those scsi devices under hba in case any of them has failed
to be
resumed due to hba runtime resume error.
o Fix a racing problem between eh_work and ufshcd_suspend/resume. In
the
old code, it blocks scsi requests before schedules eh_work, but
when
eh_work calls pm_runtime_get_sync(), if ufshcd_suspend/resume is
sending
a scsi cmd, most likely the SSU cmd, pm_runtime_get_sync() will
never
return because scsi requests were blocked. To fix this racing
problem,
o Don't block scsi requests before schedule eh_work, but let
eh_work
block scsi requests when eh_work is ready to start error
recovery.
o Meanwhile, if eh_work is schueduled due to fatal error, don't
requeue
the scsi cmds sent from ufshcd_suspend/resume path, but simply
let the
scsi cmds fail. If the scsi cmds fail, hba runtime suspend/resume
fails
too, but it does hurt since eh_work recovers hba runtime PM
error.
o Move host/regs dump in ufshcd_check_errors() to eh_work because
heavy
dump in IRQ context can lead to stability issues. In addition, some
clean
up in ufshcd_print_host_regs() and ufshcd_print_host_state().
The above list is a long list. To me that is a sign that this patch
needs to
be split into multiple patches.
Thanks,
Bart.
Sure, will split it into a few patches.
Thanks,
Can Guo.
I tried, but I find it hard to split it as it works as a whole, it is a
refactor
change rather than a mixture of multiple fixes. I will try to refine the
commit
msg in next version. So it goes just as it is now.
Thanks,
Can Guo.