Re: [REQUEST DISCUSS]: speed up SCSI error handle for host with massive devices

Wenchao Hao <haowenchao@xxxxxxxxxx> · Wed, 6 Apr 2022 17:40:26 +0800

On 2022/4/4 1:14, Mike Christie wrote:
> On 3/30/22 5:59 AM, Wenchao Hao wrote:
>> On 2022/3/30 17:32, Hannes Reinecke wrote:
>>> On 3/30/22 11:11, Wenchao Hao wrote:
>>>> On 2022/3/30 2:56, Hannes Reinecke wrote:
>>>>> On 3/29/22 14:40, Wenchao Hao wrote:
>>>>>> On 2022/3/29 18:56, Steffen Maier wrote:
>>>>>>> On 3/29/22 11:06, Wenchao Hao wrote:
>>>>>>>> SCSI timeout would call scsi_eh_scmd_add() on some conditions, host would be set
>>>>>>>> to SHOST_RECOVERY state. Once host enter SHOST_RECOVERY, IOs submitted to all
>>>>>>>> devices in this host would not succeed until the scsi_error_handler() finished.
>>>>>>>> The scsi_error_handler() might takes long time to be done, it's unbearable when
>>>>>>>> host has massive devices.
>>>>>>>>
>>>>>>>> I want to ask is anyone applying another error handler flow to address this
>>>>>>>> phenomenon?
>>>>>>>>
>>>>>>>> I think we can move some operations(like scsi get sense, scsi send startunit
>>>>>>>> and scsi device reset) out of scsi_unjam_host(), to perform these operations
>>>>>>>> without setting host to SHOST_RECOVERY? It would reduce the time of block the
>>>>>>>> whole host.
>>>>>>>>
>>>>>>>> Waiting for your discussion.
>>>>>>>
>>>>>>> We already have "async" aborts before even entering scsi_eh. So your use case seems to imply that those aborts fail and we enter scsi_eh?
>>>>>>>
>>>>>>
>>>>>> Yes, I mean when scsi_abort_command() failed and scsi_eh_scmd_add() is called.
>>>>>>
>>>>>>> There's eh_deadline for limiting the time spent in escalation of scsi_eh, and instead directly go to host reset. Would this help?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> The deadline seems not helpful. What we want to see is a single LUN's command error
>>>>>> would not stop other LUNs which share the same host. So my plan is to move reset LUN out
>>>>>> from scsi_unjam_host() which run with host set to SHOST_RECOVERY.
>>>>>
>>>>> Nope. One of the key points of scsi_unjam_host() is that is has to stop all I/O before proceeding. Without doing so basically all SCSI parallel HBAs will fail EH as they _require_ I/O to be stopped.
>>>>>
>>>>
>>>> I still can not understand why we must stop all I/O. In my comprehension, stopping all I/O
>>>> is because we might reset host during scsi_error_handler() and we must wait host's number of
>>>> failed command equal to number of busy command then we can wake up scsi_error_handler().
>>>>
>>>> If move reset LUN out of scsi_error_handler(), and perform single LUN reset, we only need
>>>> stop I/O of this single LUN, this would not affect other LUNs. If single LUN reset failed,
>>>> we can then call in the large scale error handle.
>>>>
>>> I know the EH flow.
>>>
>>> Problem here is the way parallel SCSI operates. Remember, parallel SCSI is a _bus_, and there can be only one command at a time on the bus.
>>> So if one command on the bus misfires and you have to start EH you have to stop all I/O on the bus to ensure that your EH command is the only one active on the bus.
>>>
>>
>> Thank you for you explanation, it's clear to me now.
>>
>>> For modern HBAs we sure can device other ways and means of error recovery, but I can't really see how we would do that on legacy HBAs.
>>>
>>
>> How about define a new return value of scsi_host_template's eh_timed_out callback which indicate this timeout
>> is totally handled by LLDs. Like following:
>>
>> --- a/drivers/scsi/scsi_error.c
>> +++ b/drivers/scsi/scsi_error.c
>> @@ -359,6 +359,8 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
>>                         set_host_byte(scmd, DID_TIME_OUT);
>>                         scsi_eh_scmd_add(scmd);
>>                 }
>> +       } else if (rtn == EH_HANDLED_BY_DRIVERS) {
>> +               return BLK_EH_DONE;
>>         }
>>
>> Or scsi_host_template's eh_timed_out should not do this, we can define another callback?
>> In the LLDs's timeout handler callback, apply single LUN reset first flow as previous mail metioned.
>>
> 
> You probably want to add a scsi_host_template field or new callouts
> because some drivers only preallocate enough resources for one TMF
> at a time so we have to do it on a driver by driver basis.
> 
> For driver's setting/implementing it then to escalate you can do something
> like:
> 
> 1. Block the queue for just the LU with scsi_internal_device_block.
> 2. You can then call the new callout or old one if we just added some
> new field.
> 3. If device/lun reset fails, then block the target. Then you can do
> the host.
> 

Thanks a lot for your guide. I want to wait hannes's change.

> We could share code with scsi_ioctl_reset as well. Drivers that support
> TMFs via that ioctl already expect queuecommand to be possibly in the
> middle of a run and IO not yet timed out. For example, the code to
> block a queue and reset the device could be used for the new EH and
> SG_SCSI_RESET_DEVICE handling.
> 
> 
> 
>> Anyway, what we need is a way to reduce the time of setting host to SHOST_RECOVERY.
>>
>>>> Here is a brief flow:
>>>>
>>>> abort command
>>>>     ||
>>>>     || failed
>>>>     ||
>>>>     \/
>>>> stop single LUN's I/O (need to wait LUN's failed command number equal to busy command  number)
>>>>     ||
>>>>     || failed  (according to our statistic, 90% reset LUN would succeed)
>>>>     ||
>>>>     \/
>>>
>>> Interesting. This does not match up with my experience, where 99% of the errors were due to a command timeout.
>>>
>>> So which errors do you see here? What are the causes?
>>
>> These error statistic are from our consumers' environment,they told me about 90% timeout triggered errors can be
>> handled by reset LUN.
>>
> 
> Is this with specific drivers, transport or targets?
> .
> 

Yes, it's based on specific hardware. Details about the statistic seems secret which can not send out.