Re: [PATCH v1 2/3] scsi: ufs: Optimize host lock on transfer requests send/compl paths

Can Guo <cang@xxxxxxxxxxxxxx> · Wed, 23 Jun 2021 10:04:39 +0800




Hi Bart,

On 2021-06-17 10:49, Bart Van Assche wrote:
On 5/24/21 1:36 AM, Can Guo wrote:
@@ -2688,6 +2705,43 @@ static int ufshcd_queuecommand(struct Scsi_Host 
*host, struct scsi_cmnd *cmd)
+	case UFSHCD_STATE_EH_SCHEDULED_FATAL:
+		/*
+		 * pm_runtime_get_sync() is used at error handling preparation
+		 * stage. If a scsi cmd, e.g. the SSU cmd, is sent from hba's
+		 * PM ops, it can never be finished if we let SCSI layer keep
+		 * retrying it, which gets err handler stuck forever. Neither
+		 * can we let the scsi cmd pass through, because UFS is in bad
+		 * state, the scsi cmd may eventually time out, which will get
+		 * err handler blocked for too long. So, just fail the scsi cmd
+		 * sent from PM ops, err handler can recover PM error anyways.
+		 */
+		if (hba->pm_op_in_progress) {
+			hba->force_reset = true;
+			set_host_byte(cmd, DID_BAD_TARGET);
+			cmd->scsi_done(cmd);
+			goto out;
+		}
+		fallthrough;

Hi Can,

I know that this patch only moves the above code and that the above 
code
has not been introduced by this patch. Anyway, is my understanding
correct that ufshcd_err_handler() can change the host controller state
from UFSHCD_STATE_EH_SCHEDULED_FATAL into UFSHCD_STATE_RESET and next
into UFSHCD_STATE_OPERATIONAL? If so, if the above code completes a 
READ
with status DID_BAD_TARGET and if recovery by the error handler
succeeds, will that cause the filesystem above the UFS driver to change
into read-only mode? If the above code completes a WRITE with status
DID_BAD_TARGET, will that cause data corruption? Is there any other
solution to prevent data corruption than merging the
UFSHCD_STATE_EH_SCHEDULED_FATAL and UFSHCD_STATE_EH_SCHEDULED_NON_FATAL
back into a single state and changing the ufshcd_rpm_get_sync(hba) call
in ufshcd_err_handling_prepare() into a pm_runtime_get_noresume() call?


Here, when hba->pm_op_in_progress is true, there cannot be READ or WRITE
command since hba is resuming or suspending. When fatal erorr happens, 
the
DID_BAD_TARGET above is intend to let the SSU (or whatever PM requests
blocking suspend/resume) fail fast (neither returning HOST_BUSY nor 
letting
the cmd pass through can achieve such purpose), so that error handling 
prepare
won't get stuck [1] when it calls

lock_system_sleep()
runtime_pm_get_sync()

The reason why I split UFSHCD_STATE_EH_SCHEDULED to 
UFSHCD_STATE_EH_SCHEDULED_FATAL
and UFSHCD_STATE_EH_SCHEDULED_NON_FATAL is that

1. For non-fatal errors, HW can recover by itself, so when host state is
UFSHCD_STATE_EH_SCHEDULED_NON_FATAL, cmd can still passthrough.

2. When non-fatal error (LINE-RESET for example) happens, error handler 
only
needs to do a power mode transition without a full reset. If we only 
have one
state, returning HOST_BUSY will get error handling prepare stuck [1], 
while
fast failing SSU cmds shall make error handler do a full reset (which 
goes
too far for non-fatal errors).

Thanks,

Can Guo.

Thanks,

Bart.