Re: [PATCH 3/4] scsi: Internally retry scsi_execute commands

Martin Wilck <mwilck@xxxxxxxx> · Wed, 10 Aug 2022 12:46:17 +0200

Hi Mike,

On Tue, 2022-08-09 at 22:41 -0500, Mike Christie wrote:
> In several places like LU setup and pr_ops we will hit the noretry
> code
> path because we do not retry any passthrough commands that hit device
> errors even though scsi-ml thinks the command is retryable.
> 
> This has us only fast fail commands that hit device errors that have
> been
> submitted through the block layer directly and not via scsi_execute.
> This
> allows SG IO and other users to continue to get fast failures and all
> device errors returned to them, and scsi_execute users will get their
> retries they had requested.
> 
> Signed-off-by: Mike Christie <michael.christie@xxxxxxxxxx>

Thanks a lot. I like the general approach, but I am concerned that by
treating every command sent by scsi_execute() or scsi_execute_req() as
a retryable command, we may break some callers, or at least modify
their semantics in unexpected ways. For example, spi_execute(),
scsi_probe_lun(), scsi_report_lun_scan() currently retry only on UA.
With this change, these commands will be retried in additional cases,
without the caller noticing (see 949bf797595fc ("[SCSI] fix command
retries in spi_transport class"). It isn't obvious to me that this is
correct in all affected cases. 

Note that the retry logic of the mid level may change depending on the
installed device handler. For example, ALUA devices will endlessly
retry UA with ASC=29, which some callers explicitly look for.

I believe we need to review all callers that have their own retry loop
and /or error handling logic. This includes sd_spinup_disk(),
sd_sync_cache(), scsi_test_unit_ready(). SCSI device handlers treat
some sense keys in special ways, or may retry commands on another
member of a port group (see alua_rtpg()).

DID_TIME_OUT is a general concern - no current caller of scsi_execute()
expects timed-out commands to be retried, and doing so has the
potential of slowing down operations a lot. I am aware that my recent
patch changed exactly this for scsi_probe_lun(), but doing the same
thing for every scsi_execute() invocation sounds at least somewhat
dangerous.

Bottom line: I wonder if SUBMITTED_BY_SCSI_EXEC is suffient to
determine that a command should be retried like an ordinary block level
command.

Regards,
Martin

> ---
>  drivers/scsi/scsi_error.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index ac4471e33a9c..573d926220c4 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -1806,7 +1806,8 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd)
>          * assume caller has checked sense and determined
>          * the check condition was retryable.
>          */
> -       if (req->cmd_flags & REQ_FAILFAST_DEV ||
> blk_rq_is_passthrough(req))
> +       if (req->cmd_flags & REQ_FAILFAST_DEV ||
> +           scmd->submitter == SUBMITTED_BY_BLOCK_PT)
>                 return true;
>  
>         return false;