Re: [PATCH v3 3/8] scsi: core: Support failing requests while recovering

Mike Christie <michael.christie@xxxxxxxxxx> · Mon, 3 Oct 2022 12:27:06 -0500

On 9/29/22 5:00 PM, Bart Van Assche wrote:
> The current behavior for SCSI commands submitted while error recovery
> is ongoing is to retry command submission after error recovery has
> finished. See also the scsi_host_in_recovery() check in
> scsi_host_queue_ready(). Add support for failing SCSI commands while
> host recovery is in progress. This functionality will be used to fix a
> deadlock in the UFS driver.
> 
> Cc: Christoph Hellwig <hch@xxxxxx>
> Cc: Ming Lei <ming.lei@xxxxxxxxxx>
> Cc: John Garry <john.garry@xxxxxxxxxx>
> Cc: Mike Christie <michael.christie@xxxxxxxxxx>
> Cc: Hannes Reinecke <hare@xxxxxxx>
> Signed-off-by: Bart Van Assche <bvanassche@xxxxxxx>
> ---
>  drivers/scsi/scsi_lib.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 473d9403f0c1..ecd2ce3815df 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1337,9 +1337,6 @@ static inline int scsi_host_queue_ready(struct request_queue *q,
>  				   struct scsi_device *sdev,
>  				   struct scsi_cmnd *cmd)
>  {
> -	if (scsi_host_in_recovery(shost))
> -		return 0;
> -
>  	if (atomic_read(&shost->host_blocked) > 0) {
>  		if (scsi_host_busy(shost) > 0)
>  			goto starved;
> @@ -1727,6 +1724,11 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
>  	ret = BLK_STS_RESOURCE;
>  	if (!scsi_target_queue_ready(shost, sdev))
>  		goto out_put_budget;
> +	if (unlikely(scsi_host_in_recovery(shost))) {
> +		if (req->cmd_flags & REQ_FAILFAST_MASK)
> +			ret = BLK_STS_OFFLINE;
> +		goto out_dec_target_busy;
> +	}
>  	if (!scsi_host_queue_ready(q, shost, sdev, cmd))
>  		goto out_dec_target_busy;
>  

This might add a regression to dm-multipath or it at least makes
the behavior difficult to understand for users and support people.

If there is a transport issue and the cmd times out and the abort
does as well or fails, then we would start the scsi eh recovery. When the
driver/transport class figures out it's a transport issue we will block
the scsi_device. So before this patch we would requeue the cmd then we
would wait until the fast_io_fail (FC) or replacement_timer (iscsi) to
fire before failing commands upwards and failing paths.

With this patch we can end up doing 2 or 3 things depending on timing:
1. If the cmd hits the code above we will fail the command and cause a
path failure.

2. If driver/transport blocks the scsi_device first then we would hit the
scsi_device state check in scsi_queue_rq and not fail the path like before.

3. With or without your patch, if dm_mq_queue_rq calls the busy callback
first (this does blk_lld_busy -> scsi_mq_lld_busy -> scsi_host_in_recovery)
then it might see all the paths in recovery. It considers this a temp condition
and will requeue cmds. So in this case we will not fail the path.

I'm not sure what the correct fix is. Maybe not fast fail REQ_FAILFAST_TRANSPORT
above, or maybe add a new FAILFAST type that you could use?