Re: [2.6.27.25] Hang in SCSI sync cache when a disk is removed--?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Desai, Kashyap <Kashyap.Desai@xxxxxxx> wrote:
> Regarding Jame's comment I want to add some info.
> When we enter sd_remove() which tries to flush the
> cache with SYNCHRONIZE CACHE We are seeing system hung. In my guess, MPT driver is not even receiving command for synchronize cache. (If I refer back trace provided in first mail, scsi_dispatch_cmd() might not be called. Back trace suggests hang in scsi_get_command() just before scsi_dispatch_cmd() )
>  
> 
The SYNCHRONIZE CACHE is blocked by the host being in error recovery.
Since the SYNCHRONIZE CACHE is being driven off the mpt work queue it will
block the scsi error handler thread from completing as the
mptscsih_host_reset leads to calling flush_workqueue which leads to this
deadlock.

In my previous email I listed the lead up events. 

We start with blk_abort_queue scheduling error recovery (Issue previous reported). In theory the hang issue could occur in other error handler / device delete scenarios, but with much less probability. The Linux version also does not contain support for DID_TRANSPORT_DISRUPTED so this work around cannot be used.

The second issue is that we continue through progressive error handling
steps when we do not need to as we believe the device needs further error
recovery. Leading the host reset routine being called.

> Even if synchronize cache command reaches to mptsas, mptsas will return with DID_NO_CONNECT since hostdata is no more valid. 
> Here is snippet of mptsas code.
> 
> ------------------------------------------------------
> mptsas_qcmd(struct scsi_cmnd *SCpnt, void (*done)(struct scsi_cmnd *))
> {
>         VirtDevice      *vdevice = SCpnt->device->hostdata;
>  
>         if (!vdevice || !vdevice->vtarget || vdevice->vtarget->deleted) {
>                 SCpnt->result = DID_NO_CONNECT << 16;
>                 done(SCpnt);
>                 return 0;
>         }
> ------------------------------------------------------
> Thanks,
> Kashyap
> 
> -----Original Message-----
> From: linux-scsi-owner@xxxxxxxxxxxxxxx [mailto:linux-scsi-owner@xxxxxxxxxxxxxxx] On Behalf Of Paul Smith
> Sent: Tuesday, July 07, 2009 8:04 PM
> To: James Bottomley
> Cc: Mike Anderson; linux-scsi@xxxxxxxxxxxxxxx; Mike Christie; Moore, Eric
> Subject: Re: [2.6.27.25] Hang in SCSI sync cache when a disk is removed--?
> 
> Hi James; thanks for that examination; it's very helpful.
> 
> Unfortunately Eric is on vacation until the middle of the month and we
> really need to resolve this issue this week if possible.  I'm forwarding
> your message to the LSI developers we've been working with.
> 
> MikeA: we're working on getting the sysrq "t" output in the meantime,
> just in case it's revealing.
> 
> On Tue, 2009-07-07 at 08:58 -0500, James Bottomley wrote:
> > On Mon, 2009-07-06 at 23:25 -0700, Mike Anderson wrote:
> > > Paul Smith <paul@xxxxxxxxxxxxxxxxx> wrote:
> > > > 
> > > 
> > > I was expecting a little more output from the error handler thread, but
> > > the log does show a few things.
> > > 
> > > It would be good if in the failing case you could provide a sysrq "t"
> > > output so I could understand where the reset handler is waiting.
> > > 
> > > It appears there are a few things going on.
> > > 1.) The dm deactivate calling blk_abort_queue is leading to error handler
> > > activation. Similar to a previously described issue.
> > > http://permalink.gmane.org/gmane.linux.kernel.device-mapper.devel/8543
> > > 	- This kernel does not have DID_TRANSPORT_DISRUPTED so that
> > > 	  avoidance method cannot be used.
> > > 2.) The task aborts are completing, but the tur is most likely being
> > > failed with a response of DID_BUS_BUSY leading to continued recovery.
> > > 3.) We appear to be inside mpt_HardResetHandler, but need more info to
> > > understand where in the call chain.
> > 
> > Actually, isn't the problem much simpler?
> > 
> > The mptsas driver calls sas_port_delete() when the event occurs.  This
> > deletes the rphy and invokes scsi_remove_target().  It looks like the
> > device had a write back cache, so part of scsi_remove_target() goes to
> > scsi_remove_device() which triggers sd_remove() which tries to flush the
> > cache with SYNCHRONIZE CACHE.
> > 
> > This is the point at which the hang occurs.  It seems that the mptsas
> > goes out to lunch when it sees a command to a device on a deleted port.
> > The remainder of the log is error handling trying to get the attention
> > of the mptsas firmware back again.
> > 
> > This is a pretty huge problem because any set of commands can be racing
> > with surprise ejection ... there's no way we can gate it in the mid
> > layer.  The behaviour we expect is that after surprise ejection, a
> > driver/device will automatically error (with something like
> > DID_NO_CONNECT) all commands for the ejected device.
> > 
> > James
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-andmike
--
Michael Anderson
andmike@xxxxxxxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux