Re: [PATCH] scsi: core: always finish successfully aborted notry command

Mike Christie <michael.christie@xxxxxxxxxx> · Wed, 6 Apr 2022 11:08:56 -0500

On 4/6/22 1:56 AM, Wenchao Hao wrote:
> On 2022/4/4 0:58, Mike Christie wrote:
>> On 3/31/22 10:50 PM, Wenchao Hao wrote:
>>> If the abort command succeed and it does not need to be retired, do not add
>>> it to error handle list. 
>>>
>>> Adding command to error handle list is an annoying flow which would stop I/O
>>> of all LUNs which shared a same HBA.
>>>
>>> So here if we successfully abort a command, we can finish it via
>>> scsi_finish_command() which would reduce time spent on scsi_error_handler()
>>>
>>> Signed-off-by: Wenchao Hao <haowenchao@xxxxxxxxxx>
>>> ---
>>>  drivers/scsi/scsi_error.c | 55 +++++++++++++++++++++------------------
>>>  1 file changed, 29 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>>> index cdaca13ac1f1..15299603b7ee 100644
>>> --- a/drivers/scsi/scsi_error.c
>>> +++ b/drivers/scsi/scsi_error.c
>>> @@ -173,41 +173,44 @@ scmd_eh_abort_handler(struct work_struct *work)
>>>  		goto out;
>>>  	}
>>>  	set_host_byte(scmd, DID_TIME_OUT);
>>> -	if (scsi_host_eh_past_deadline(shost)) {
>>> -		SCSI_LOG_ERROR_RECOVERY(3,
>>> -			scmd_printk(KERN_INFO, scmd,
>>> -				    "eh timeout, not retrying "
>>> -				    "aborted command\n"));
>>> -		goto out;
>>> -	}
>>>  
>>> -	spin_lock_irqsave(shost->host_lock, flags);
>>> -	list_del_init(&scmd->eh_entry);
>>> +	if (scsi_noretry_cmd(scmd) ||
>>> +	    !scsi_cmd_retry_allowed(scmd) &&
>>> +	    !scsi_eh_should_retry_cmd(scmd)) {
>>
>>
>> I don't think this test is correct. Did you want all ||s?
> 
> Sorry, I made a mistake, it should be "||" here.
> 
>>
>> For what the patch is trying to accomplish I'm not sure if it's
>> the behavior people wanted. Do all drivers have configurable
>> abort timeouts? If an abort takes N minutes to complete ok,
>> and that's put us over the eh deadline maybe the user wanted
>> that device to be offlined.
>> .
>>
> 
> I think the changes would not affect the timeouts and deadline
> because the logic changed is after command abort successfully.
> 
> We check if the command need to be retired after aborting command
> successfully, if it does not need to retry, just finish it (only
> commands aborted failed should be add to error handle list).

I'm saying an abort can complete but the device could be flakey.
The user could have set the deadline for 5 minutes. If the abort
took 5 minutes and 1 second then the user might want that device
set to offline, because retrying that device could take another 5
minutes. Instead of waiting for N retries they might have just
wanted all to that device failed.

> 
> I submit this change because we want to reduce the time spent on 
> scsi_error_handler().

I get that and I agree users do not want to wait for the entire
host to be blocked. I'm just saying that it might change behavior
that users wanted where the device will be offlined. I'm not 100% sure
though, so it's better for maybe Hannes and others to chime in.

Fixing up the EH to work in a way that does not require the entire
host to be blocked like in your other post is best. It would
give you want you want, and preserve the existing behavior and also
help other users.