Re: [PATCH v3 3/8] scsi: core: Support failing requests while recovering

Bart Van Assche <bvanassche@xxxxxxx> · Mon, 3 Oct 2022 12:20:10 -0700

On 10/3/22 10:27, Mike Christie wrote:
On 9/29/22 5:00 PM, Bart Van Assche wrote:
The current behavior for SCSI commands submitted while error recovery
is ongoing is to retry command submission after error recovery has
finished. See also the scsi_host_in_recovery() check in
scsi_host_queue_ready(). Add support for failing SCSI commands while
host recovery is in progress. This functionality will be used to fix a
deadlock in the UFS driver.

Cc: Christoph Hellwig <hch@xxxxxx>
Cc: Ming Lei <ming.lei@xxxxxxxxxx>
Cc: John Garry <john.garry@xxxxxxxxxx>
Cc: Mike Christie <michael.christie@xxxxxxxxxx>
Cc: Hannes Reinecke <hare@xxxxxxx>
Signed-off-by: Bart Van Assche <bvanassche@xxxxxxx>
---
  drivers/scsi/scsi_lib.c | 8 +++++---
  1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 473d9403f0c1..ecd2ce3815df 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1337,9 +1337,6 @@ static inline int scsi_host_queue_ready(struct request_queue *q,
  				   struct scsi_device *sdev,
  				   struct scsi_cmnd *cmd)
  {
-	if (scsi_host_in_recovery(shost))
-		return 0;
-
  	if (atomic_read(&shost->host_blocked) > 0) {
  		if (scsi_host_busy(shost) > 0)
  			goto starved;
@@ -1727,6 +1724,11 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
  	ret = BLK_STS_RESOURCE;
  	if (!scsi_target_queue_ready(shost, sdev))
  		goto out_put_budget;
+	if (unlikely(scsi_host_in_recovery(shost))) {
+		if (req->cmd_flags & REQ_FAILFAST_MASK)
+			ret = BLK_STS_OFFLINE;
+		goto out_dec_target_busy;
+	}
  	if (!scsi_host_queue_ready(q, shost, sdev, cmd))
  		goto out_dec_target_busy;
  

This might add a regression to dm-multipath or it at least makes
the behavior difficult to understand for users and support people.

If there is a transport issue and the cmd times out and the abort
does as well or fails, then we would start the scsi eh recovery. When the
driver/transport class figures out it's a transport issue we will block
the scsi_device. So before this patch we would requeue the cmd then we
would wait until the fast_io_fail (FC) or replacement_timer (iscsi) to
fire before failing commands upwards and failing paths.

With this patch we can end up doing 2 or 3 things depending on timing:
1. If the cmd hits the code above we will fail the command and cause a
path failure.

2. If driver/transport blocks the scsi_device first then we would hit the
scsi_device state check in scsi_queue_rq and not fail the path like before.

3. With or without your patch, if dm_mq_queue_rq calls the busy callback
first (this does blk_lld_busy -> scsi_mq_lld_busy -> scsi_host_in_recovery)
then it might see all the paths in recovery. It considers this a temp condition
and will requeue cmds. So in this case we will not fail the path.

I'm not sure what the correct fix is. Maybe not fast fail REQ_FAILFAST_TRANSPORT
above, or maybe add a new FAILFAST type that you could use?

Hi Mike,

I will look into introducing a new request flag for this purpose.

Thanks,

Bart.