Re: SCSI layer RPM deadlock debug suggestion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jul 05, 2021 at 01:00:39PM +0100, John Garry wrote:
> On 05/07/2021 00:45, Bart Van Assche wrote:
> 
> Hi Alan and Bart,
> 
> Thanks for the suggestions.
> 
> > > > Removing commit e27829dc92e5 ("scsi: serialize ->rescan against ->remove")
> > > > solves this issue for me, but that is there for a reason.
> > > > 
> > > > Any suggestion on how to fix this deadlock?
> > > This is indeed a tricky question.  It seems like we should allow a
> > > runtime resume to succeed if the only reason it failed was that the
> > > device has been removed.
> > > 
> > > More generally, perhaps we should always consider that a runtime
> > > resume succeeds.  Any remaining problems will be dealt with by the
> > > device's driver and subsystem once the device is marked as
> > > runtime-active again.
> > > 
> > > Suppose you try changing blk_post_runtime_resume() so that it always
> > > calls blk_set_runtime_active() regardless of the value of err.  Does
> > > that fix the problem?
> > > 
> > > And more importantly, will it cause any other problems...?
> > That would cause trouble for the UFS driver and other drivers for which
> > runtime resume can fail due to e.g. the link between host and device
> > being in a bad state.

I don't understand how that could work.  If a device fails to resume 
from runtime suspend, no matter whether the reason is temporary or 
permanent, how can the system use it again?

And if the system can't use it again, what harm is there in pretending 
that the runtime resume succeeded?

> > How about checking the SCSI device state inside scsi_rescan_device() and
> > skipping the rescan if the SCSI device state is SDEV_CANCEL or SDEV_DEL?
> > 
> 
> I find that the device state is SDEV_RUNNING for me at that point (so it
> cannot work).
> 
> > Adding such a check inside __scsi_execute() would break sd_remove() and
> > sd_shutdown() since both use __scsi_execute() to submit a SYNCHRONIZE
> > CACHE command to the device.
> 
> Could we somehow signal from __scsi_remove_device() earlier that the request
> queue is dying or at least in some error state, so that blk_queue_enter() in
> the rescan can fail?
> 
> Currently we don't call blk_cleanup_queue() -> blk_set_queue_dying() until
> after the device_del(sdev_gendev) call in __scsi_remove_device().

I don't think that can be done.  device_del() calls the driver's 
remove routine, which may want to communicate with the device.  If the 
request queue is already in an error state, it won't be able to do so.

Alan Stern



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]

  Powered by Linux