Re: [PATCH] sd: Fix a disk probing hang

Hannes Reinecke <hare@xxxxxxxx> · Wed, 8 Nov 2017 09:12:53 +0100

On 11/07/2017 11:57 PM, James Bottomley wrote:
> On Tue, 2017-11-07 at 22:42 +0000, Bart Van Assche wrote:
>> On Tue, 2017-11-07 at 10:09 -0800, James Bottomley wrote:
>>>
>>> but can you investigate the root cause rather than trying this
>>> bandaid?
>>
>> Hello James,
>>
>> Thanks for your reply. I think that the root cause is that SCSI
>> scanning activity can continue to submit I/O even after
>> scsi_remove_host() has unlocked scan_mutex but that
>> scsi_remove_host() removes some of the infrastructure that is
>> essential to process SCSI requests.
> 
> That's not really a useful answer: how does it submit I/O after the
> device goes into DEL?  In theory every I/O submitted after this is
> returned with an immediate error.  I could buy the fact that we have
> pending I/O submitted before we go into DEL, which would argue for some
> sort of quiesce wait, but I don't see how I/O submitted after DEL
> causes a hang.
> 
>>  Are you OK with
>> e.g. moving a significant part of scsi_remove_host() into
>> scsi_host_dev_release()?
> 
> Well not really without seeing the root cause.  Before scsi_forget_host
> ()it's all about state and after it's just removing some user visible
> host attributes, so I can't see how either matters much.
>  scsi_forget_host() must be executed from scsi_remove_host() because
> that's how the devices go into the DEL state and how we error the
> requests without troubling the device driver, so that can't be moved to
> release
> 
You know, this actually looks like the same issue I'm chasing with iser;
we have a customer who regularly sees lockups during scanning.
As it turns out, iser is calling scsi_device_del() from the RX thread.
Which in turn needs to call async_synchronize().
If a disk scan is running at the same time we have a nice deadlock, as
the RX thread can't move forward before aynch_synchronize() returns,
which it'll never do as the scan cannot complete.
I've tried to fix that by having the async probing only waiting for that
particular instance (look for patch 'sd: use async_probe cookie to avoid
deadlocks'), but this wasn't greeted with much enthusiasm.

So maybe it's time to investigate this properly.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare@xxxxxxxx			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)