On 2024/6/13 15:10, Wenchao Hao wrote: > On 2024/6/13 14:27, Hannes Reinecke wrote: >> On 6/12/24 17:06, Wenchao Hao wrote: >>> On 6/12/24 4:33 PM, Hannes Reinecke wrote: >>>> On 6/5/24 11:17, Wenchao Hao wrote: >>>>> shost_for_each_device() would skip devices which is in SDEV_CANCEL or >>>>> SDEV_DEL state, for some scenarios, we donot want to skip these devices, >>>>> so add a new macro shost_for_each_device_include_deleted() to handle it. >>>>> >>>>> Following changes are introduced: >>>>> >>>>> 1. Rework scsi_device_get(), add new helper __scsi_device_get() which >>>>> determine if skip deleted scsi_device by parameter "skip_deleted". >>>>> 2. Add new parameter "skip_deleted" to __scsi_iterate_devices() which >>>>> is used when calling __scsi_device_get() >>>>> 3. Update shost_for_each_device() to call __scsi_iterate_devices() with >>>>> "skip_deleted" true >>>>> 4. Add new macro shost_for_each_device_include_deleted() which call >>>>> __scsi_iterate_devices() with "skip_deleted" false >>>>> >>>>> Signed-off-by: Wenchao Hao <haowenchao22@xxxxxxxxx> >>>>> --- >>>>> drivers/scsi/scsi.c | 46 ++++++++++++++++++++++++++------------ >>>>> include/scsi/scsi_device.h | 25 ++++++++++++++++++--- >>>>> 2 files changed, 54 insertions(+), 17 deletions(-) >>>>> >>>>> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c >>>>> index 3e0c0381277a..5913de543d93 100644 >>>>> --- a/drivers/scsi/scsi.c >>>>> +++ b/drivers/scsi/scsi.c >>>>> @@ -735,20 +735,18 @@ int scsi_cdl_enable(struct scsi_device *sdev, bool enable) >>>>> return 0; >>>>> } >>>>> -/** >>>>> - * scsi_device_get - get an additional reference to a scsi_device >>>>> +/* >>>>> + * __scsi_device_get - get an additional reference to a scsi_device >>>>> * @sdev: device to get a reference to >>>>> - * >>>>> - * Description: Gets a reference to the scsi_device and increments the use count >>>>> - * of the underlying LLDD module. You must hold host_lock of the >>>>> - * parent Scsi_Host or already have a reference when calling this. >>>>> - * >>>>> - * This will fail if a device is deleted or cancelled, or when the LLD module >>>>> - * is in the process of being unloaded. >>>>> + * @skip_deleted: when true, would return failed if device is deleted >>>>> */ >>>>> -int scsi_device_get(struct scsi_device *sdev) >>>>> +static int __scsi_device_get(struct scsi_device *sdev, bool skip_deleted) >>>>> { >>>>> - if (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL) >>>>> + /* >>>>> + * if skip_deleted is true and device is in removing, return failed >>>>> + */ >>>>> + if (skip_deleted && >>>>> + (sdev->sdev_state == SDEV_DEL || sdev->sdev_state == SDEV_CANCEL)) >>>>> goto fail; >>>> >>>> Nack. >>>> SDEV_DEL means the device is about to be deleted, so we _must not_ access it at all. >>>> >>> >>> Sorry I added SDEV_DEL here at hand without understanding what it means. >>> Actually, just include scsi_device which is in SDEV_CANCEL would fix the >>> issues I described. >>> >>> The issues are because device removing concurrent with error handle. >>> Normally, error handle would not be triggered when scsi_device is in >>> SDEV_DEL. Below is my analysis, if it is wrong, please correct me. >>> >>> If there are scsi_cmnd remain unfinished when removing scsi_device, >>> the removing process would waiting for all commands to be finished. >>> If commands error happened and trigger error handle, the removing >>> process would be blocked until error handle finished, because >>> __scsi_remove_device called del_gendisk() which would wait all >>> requests to be finished. So now scsi_device is in SDEV_CANCEL. >>> >>> If the scsi_device is already in SDEV_DEL, then no scsi_cmnd has been >>> dispatched to this scsi_device, then error handle would never triggered. >>> >>> I want to change the new function __scsi_device_get() as following, >>> please help to review. >>> >>> /* >>> * __scsi_device_get - get an additional reference to a scsi_device >>> * @sdev: device to get a reference to >>> * @skip_canceled: when true, would return failed if device is deleted >>> */ >>> static int __scsi_device_get(struct scsi_device *sdev, bool skip_canceled) >>> { >>> /* >>> * if skip_canceled is true and device is in removing, return failed >>> */ >>> if (sdev->sdev_state == SDEV_DEL || >>> (sdev->sdev_state == SDEV_CANCEL && skip_canceled)) >>> goto fail; >>> if (!try_module_get(sdev->host->hostt->module)) >>> goto fail; >>> if (!get_device(&sdev->sdev_gendev)) >>> goto fail_put_module; >>> return 0; >>> >>> fail_put_module: >>> module_put(sdev->host->hostt->module); >>> fail: >>> return -ENXIO; >>> } >>> >> I don't think that's required. >> With your above analysis, wouldn't the problem be solved with: >> >> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c >> index 775df00021e4..911fcfa80d69 100644 >> --- a/drivers/scsi/scsi_sysfs.c >> +++ b/drivers/scsi/scsi_sysfs.c >> @@ -1470,6 +1470,8 @@ void __scsi_remove_device(struct scsi_device *sdev) >> if (sdev->sdev_state == SDEV_DEL) >> return; >> >> + scsi_block_when_processing_errors(sdev); >> + >> if (sdev->is_visible) { >> /* >> * If scsi_internal_target_block() is running concurrently, >> >> Hmm? >> > > We can not make sure no scsi_cmnd remain unfinished after scsi_block_when_processing_errors(). For example, there is a > command has been dispatched but it's not timeouted when removing > device, no error happened. After scsi_device is set to SDEV_CANCEL, > the removing process would be blocked by del_gendisk() because there > is still a request. > > Then the request timeout and abort failed, error handle would be triggered, now scsi_device is SDEV_CANCEL. > > The error handle would skip this scsi_device when doing device reset. > >> Cheers, >> >> Hannes > Hi Hannes, Would you review these patches? Or do how do you suggest to fix the issues? thanks.