Re: [PATCH] scsi: core: Kick the requeue list after inserting when flushing

Damien Le Moal <dlemoal@xxxxxxxxxx> · Fri, 12 Jan 2024 08:41:57 +0900

On 1/11/24 21:05, Niklas Cassel wrote:
> When libata calls ata_link_abort() to abort all ata queued commands,
> it calls blk_abort_request() on the SCSI command representing each QC.
> 
> This causes scsi_timeout() to be called, which calls scsi_eh_scmd_add()
> for each SCSI command.
> 
> scsi_eh_scmd_add() sets the SCSI host to state recovery, and then adds
> the command to shost->eh_cmd_q.
> 
> This will wake up the SCSI EH, and eventually the libata EH strategy
> handler will be called, which calls scsi_eh_flush_done_q() to either
> flush retry or flush finish each failed command.
> 
> The commands that are flush retried by scsi_eh_flush_done_q() are
> done so using scsi_queue_insert().
> 
> Before commit 8b566edbdbfb ("scsi: core: Only kick the requeue list if
> necessary"), __scsi_queue_insert() called blk_mq_requeue_request() with
> the second argument set to true, indicating that it should always
> kick/run the requeue list after inserting.
> 
> After commit 8b566edbdbfb ("scsi: core: Only kick the requeue list if
> necessary"), __scsi_queue_insert() does not kick/run the requeue list
> after inserting, if the current SCSI host state is recovery (which is
> the case in the libata example above).
> 
> This optimization is probably fine in most cases, as I can only assume
> that most often someone will eventually kick/run the queues.
> 
> However, that is not the case for scsi_eh_flush_done_q(), where we can
> see that the request gets inserted to the requeue list, but the queue is
> never started after the request has been inserted, leading to the block
> layer waiting for the completion of command that never gets to run.
> 
> Since scsi_eh_flush_done_q() is called by SCSI EH context, the SCSI host
> state is most likely always in recovery when this function is called.
> 
> Thus, let scsi_eh_flush_done_q() explicitly kick the requeue list after
> inserting a flush retry command, so that scsi_eh_flush_done_q() keeps
> the same behavior as before commit 8b566edbdbfb ("scsi: core: Only kick
> the requeue list if necessary").
> 
> Simple reproducer for the libata example above:
> $ hdparm -Y /dev/sda
> $ echo 1 > /sys/class/scsi_device/0\:0\:0\:0/device/delete
> 
> Fixes: 8b566edbdbfb ("scsi: core: Only kick the requeue list if necessary")
> Reported-by: Kevin Locke <kevin@xxxxxxxxxxxxxxx>
> Closes: https://lore.kernel.org/linux-scsi/ZZw3Th70wUUvCiCY@xxxxxxxxxxxxxxx/
> Signed-off-by: Niklas Cassel <cassel@xxxxxxxxxx>
> ---
>  drivers/scsi/scsi_error.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 1223d34c04da..d983f4a0e9f1 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -2196,15 +2196,18 @@ void scsi_eh_flush_done_q(struct list_head *done_q)
>  	struct scsi_cmnd *scmd, *next;
>  
>  	list_for_each_entry_safe(scmd, next, done_q, eh_entry) {
> +		struct scsi_device *sdev = scmd->device;
> +
>  		list_del_init(&scmd->eh_entry);
> -		if (scsi_device_online(scmd->device) &&
> -		    !scsi_noretry_cmd(scmd) && scsi_cmd_retry_allowed(scmd) &&
> -			scsi_eh_should_retry_cmd(scmd)) {
> +		if (scsi_device_online(sdev) && !scsi_noretry_cmd(scmd) &&
> +		    scsi_cmd_retry_allowed(scmd) &&
> +		    scsi_eh_should_retry_cmd(scmd)) {
>  			SCSI_LOG_ERROR_RECOVERY(3,
>  				scmd_printk(KERN_INFO, scmd,
>  					     "%s: flush retry cmd\n",
>  					     current->comm));
>  				scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY);
> +				blk_mq_kick_requeue_list(sdev->request_queue);
>  		} else {
>  			/*
>  			 * If just we got sense for the device (called

Looks good to me.

Reviewed-by: Damien Le Moal <dlemoal@xxxxxxxxxx>

-- 
Damien Le Moal
Western Digital Research