On 6/13/21 7:42 AM, Can Guo wrote: > 2. ufshcd_abort() invokes ufshcd_err_handler() synchronously can have a > live lock issue, which is why I chose the asynchronous way (from the first > day I started to fix error handling). The live lock happens when abort > happens > to a PM request, e.g., a SSU cmd sent from suspend/resume. Because UFS > error > handler is synchronized with suspend/resume (by calling > pm_runtime_get_sync() > and lock_system_sleep()), the sequence is like: > [1] ufshcd_wl_resume() sends SSU cmd > [2] ufshcd_abort() calls UFS error handler > [3] UFS error handler calls lock_system_sleep() and pm_runtime_get_sync() > > In above sequence, either lock_system_sleep() or pm_runtime_get_sync() > shall > be blocked - [3] is blocked by [1], [2] is blocked by [3], while [1] is > blocked by [2]. > > For PM requests, I chose to abort them fast to unblock suspend/resume, > suspend/resume shall fail of course, but UFS error handler recovers > PM errors anyways. In the above sequence, does [2] perhaps refer to aborting the SSU command submitted in step [1] (this is not clear to me)? If so, how about breaking the circular waiting cycle as follows: - If it can happen that SSU succeeds after more than scsi_timeout seconds, define a custom timeout handler. From inside the timeout handler, schedule a link check and return BLK_EH_RESET_TIMER. If the link is no longer operational, run the error handler. If the link cannot be recovered by the error handler, fail all pending commands. This will prevent that ufshcd_abort() is called if a SSU command takes longer than expected. See also commit 0dd0dec1677e. - Modify the UFS error handler such that it accepts a context argument. The context argument specifies whether or not the UFS error handler is called from inside a system suspend or system resume handler. If the UFS error handler is called from inside a system suspend or resume callback, skip the lock_system_sleep() and unlock_system_sleep() calls. Thanks, Bart.