Re: [PATCH 1/1] scsi: fix hang when device state is set via sysfs

Mike Christie <michael.christie@xxxxxxxxxx> · Tue, 12 Oct 2021 10:52:29 -0500

On 10/12/21 10:50 AM, Mike Christie wrote:
> On 10/5/21 11:45 PM, Mike Christie wrote:
>> Cc'ing lee.
>>
>> On 10/5/21 11:31 PM, Mike Christie wrote:
>>> This fixes a regression added with:
>>>
>>> commit f0f82e2476f6 ("scsi: core: Fix capacity set to zero after
>>> offlinining device")
>>>
>>> The problem is that after iSCSI recovery, iscsid will call into the kernel
>>> to set the dev's state to running, and with that patch we now call
>>> scsi_rescan_device with the state_mutex held. If the scsi error handler
>>> thread is just starting to test the device in scsi_send_eh_cmnd then it's
>>> going to try to grab the state_mutex.
>>>
>>> We are then stuck, because when scsi_rescan_device tries to send its IO
>>> scsi_queue_rq calls -> scsi_host_queue_ready -> scsi_host_in_recovery
>>> will return true (the host state is still in recovery) and IO will just be
>>> requeued. scsi_send_eh_cmnd will then never be able to grab the
>>> state_mutex to finish error handling.
>>>
>>> This just moves the scsi_rescan_device call to after we drop the
>>> state_mutex.
>>
>>
>> I want to maybe nak my own patch. There is still a problem where if one
>> of the rescan IOs hits an issue then userspace is stuck waiting for
>> however long it takes to perform recovery. For iscsid, this will cause
>> problems because it sets the device state from its main thread. So
>> while scsi_rescan_device is hung then iscsid can't do anything for
>> any session.
>>
>> I think we either want to:
>>
>> 1. Do the patch below, but Lee will need to change iscsid so it sets
>> the dev state from a worker thread.
>>
>> 2. Have the kernel kick off the rescan from a workqueue. This seems
>> easiest but I'm not sure if it will cause issues for lijinlin's use
>> case.
> 
> I have not heard from huawei, but I don't think we can do 2. The problem
> is that I think userspace will not assume once the write returns that the

Meant userspace will now assume.