On 2020-07-13 19:28, Can Guo wrote: > o Queue eh_work on a single threaded workqueue to avoid concurrency between > eh_works. Please use another approach (mutex?) to serialize error handling. There are already way too workqueues in a running Linux system. > o According to the UFSHCI JEDEC spec, hibern8 enter/exit error occurs when > the link is broken. This actaully applies to any power mode change > operations. In this change, if a power mode change operation (including > AH8 enter/exit) fails, mark the link state as UIC_LINK_BROKEN_STATE and > schedule eh_work. eh_work needs to do full reset and restore to recover > the link back to active. Before the link state is recovered to active by > eh_work, any power mode change attempts just return -ENOLINK to avoid > consecutive HW error. > > o To avoid concurrency between eh_work and link recovery, remove link > recovery from hibern8 enter/exit func. If hibern8 enter/exit func fails, > simply return error code and let eh_work run in parallel. > > o Recover UFS hba runtime PM error in eh_work. If ufschd_suspend/resume > fails due to UFS error, e.g. hibern8 enter/exit error and SSU cmd error, > the runtime PM framework saves the error to dev.power.runtime_error. > After that, hba runtime suspend/resume would not be invoked anymore until > dev.power.runtime_error is cleared. The runtime PM error can be recovered > in eh_work by calling pm_runtime_set_active() after reset and restore > succeeds. Meanwhile, if pm_runtime_set_active() returns no error, which > means dev.power.runtime_error is cleared, we also need to explicitly > resume those scsi devices under hba in case any of them has failed to be > resumed due to hba runtime resume error. > > o Fix a racing problem between eh_work and ufshcd_suspend/resume. In the > old code, it blocks scsi requests before schedules eh_work, but when > eh_work calls pm_runtime_get_sync(), if ufshcd_suspend/resume is sending > a scsi cmd, most likely the SSU cmd, pm_runtime_get_sync() will never > return because scsi requests were blocked. To fix this racing problem, > o Don't block scsi requests before schedule eh_work, but let eh_work > block scsi requests when eh_work is ready to start error recovery. > o Meanwhile, if eh_work is schueduled due to fatal error, don't requeue > the scsi cmds sent from ufshcd_suspend/resume path, but simply let the > scsi cmds fail. If the scsi cmds fail, hba runtime suspend/resume fails > too, but it does hurt since eh_work recovers hba runtime PM error. > > o Move host/regs dump in ufshcd_check_errors() to eh_work because heavy > dump in IRQ context can lead to stability issues. In addition, some clean > up in ufshcd_print_host_regs() and ufshcd_print_host_state(). The above list is a long list. To me that is a sign that this patch needs to be split into multiple patches. Thanks, Bart.