Re: [RFC PATCH v2 00/18] scsi: scsi_error: Introduce new error handle mechanism

Mike Christie <michael.christie@xxxxxxxxxx> · Tue, 26 Sep 2023 12:37:33 -0500

On 9/26/23 7:57 AM, Wenchao Hao wrote:
> On 2023/9/26 1:54, Mike Christie wrote:
>> On 9/25/23 10:07 AM, Wenchao Hao wrote:
>>> On 2023/9/25 22:55, Christoph Hellwig wrote:
>>>> Before we add another new error handling mechanism we need to fix the
>>>> old one first.  Hannes' work on not passing the scsi_cmnd to the various
>>>> reset handlers hasn't made a lot of progress in the last five years and
>>>> we'll need to urgently fix that first before adding even more
>>>> complexity.
>>>>
>>> I observed Hannes's patches posted about one year ago, it has not been
>>> applied yet. I don't know if he is still working on it.
>>>
>>> My patches do not depend much on that work, I think the conflict can be
>>> solved fast between two changes.
>>
>> I think we want to figure out Hannes's patches first.
>>
>> For a new EH design we will want to be able to do multiple TMFs in parallel
>> on the same host/target right?
>>
> 
> It's not necessary to do multiple TMFs in parallel, it's ok to make sure
> each TMFs do not affect each other.
> 
> For example, we have two devices: 0:0:0:0 and 0:0:0:1
> 
> Both of them request device reset, they do not happened in parallel, but
> would in serial. If 0:0:0:0 is performing device reset in progress, 0:0:0:1
> just wait 0:0:0:0 to finish.

I see. I guess we still get the benefit of not having to stop other devices
when doing TMFs.

I think we still want a common way to allocate/free and manage resources
drivers will use during this time. Maybe have a init_device/target/cmd/eh_priv
and exit_device/target/eh_priv (I'm not sure of the name, but something similar
to the init_cmd_priv/exit_cmd_priv we have for normal commands.

scsi-ml then calls into the new eh with the priv data. Drivers don't have to
do the preallocation and worry if it's per device/target/host.

I'm not 100% sure about the low level details. Check out how Hannes's is
handling tag management for TMFs as well.

> 
>> The problem is that we need to be able to make forward progress in the EH
>> path and not fail just because we can't allocate memory for a TMF related
>> struct. To accomplish this now, drivers will use mempools, preallocate TMF
>> related structs/mem/tags with their scsi_cmnd related structs, preallocate
>> per host/target/device related structs or ignore what I wrote above and just
>> fail.
>>
>> Hannes's patches fix up the eh callouts so they don't pass in a scsi_cmnd
>> when it's not needed. That seems nice because after that, then for your new
>> EH we can begin to standardize on how to handle preallocation of drivers
>> resources needed to perform TMFs for your new EH. It could be a per
>> device/target/host callout to allow drivers to preallocate, then scsi-ml calls
>> into the drivers with that data. It doesn't have to be exactly like that or
>> anything close. It would be nice for drivers to not have to think about this
>> type of thing and scsi-ml just to handle the resource management for us when
>> there are multiple TMFs in progress.
>>
>