Re: [PATCH v2 4/4] scsi: ufs: Fix up and simplify error recovery mechanism

Can Guo <cang@xxxxxxxxxxxxxx> · Tue, 14 Jul 2020 12:26:06 +0800

Hi Bart,

On 2020-07-14 11:52, Bart Van Assche wrote:
On 2020-07-13 19:28, Can Guo wrote:
o Queue eh_work on a single threaded workqueue to avoid concurrency 
between
  eh_works.

Please use another approach (mutex?) to serialize error handling. There 
are
already way too workqueues in a running Linux system.

o According to the UFSHCI JEDEC spec, hibern8 enter/exit error occurs 
when
  the link is broken. This actaully applies to any power mode change
  operations. In this change, if a power mode change operation 
(including
  AH8 enter/exit) fails, mark the link state as UIC_LINK_BROKEN_STATE 
and
  schedule eh_work. eh_work needs to do full reset and restore to 
recover
  the link back to active. Before the link state is recovered to 
active by
  eh_work, any power mode change attempts just return -ENOLINK to 
avoid
  consecutive HW error.

o To avoid concurrency between eh_work and link recovery, remove link
  recovery from hibern8 enter/exit func. If hibern8 enter/exit func 
fails,
  simply return error code and let eh_work run in parallel.

o Recover UFS hba runtime PM error in eh_work. If 
ufschd_suspend/resume
  fails due to UFS error, e.g. hibern8 enter/exit error and SSU cmd 
error,
  the runtime PM framework saves the error to dev.power.runtime_error.
  After that, hba runtime suspend/resume would not be invoked anymore 
until
  dev.power.runtime_error is cleared. The runtime PM error can be 
recovered
  in eh_work by calling pm_runtime_set_active() after reset and 
restore
  succeeds. Meanwhile, if pm_runtime_set_active() returns no error, 
which
  means dev.power.runtime_error is cleared, we also need to explicitly
  resume those scsi devices under hba in case any of them has failed 
to be
  resumed due to hba runtime resume error.

o Fix a racing problem between eh_work and ufshcd_suspend/resume. In 
the
  old code, it blocks scsi requests before schedules eh_work, but when
  eh_work calls pm_runtime_get_sync(), if ufshcd_suspend/resume is 
sending
  a scsi cmd, most likely the SSU cmd, pm_runtime_get_sync() will 
never
  return because scsi requests were blocked. To fix this racing 
problem,
  o Don't block scsi requests before schedule eh_work, but let eh_work
    block scsi requests when eh_work is ready to start error recovery.
  o Meanwhile, if eh_work is schueduled due to fatal error, don't 
requeue
    the scsi cmds sent from ufshcd_suspend/resume path, but simply let 
the
    scsi cmds fail. If the scsi cmds fail, hba runtime suspend/resume 
fails
    too, but it does hurt since eh_work recovers hba runtime PM error.

o Move host/regs dump in ufshcd_check_errors() to eh_work because 
heavy
  dump in IRQ context can lead to stability issues. In addition, some 
clean
  up in ufshcd_print_host_regs() and ufshcd_print_host_state().

The above list is a long list. To me that is a sign that this patch 
needs to
be split into multiple patches.

Thanks,

Bart.

Sure, will split it into a few patches.

Thanks,

Can Guo.