Mike Anderson <andmike@xxxxxxxxxxxxxxxxxx> wrote: > Desai, Kashyap <Kashyap.Desai@xxxxxxx> wrote: > > Regarding Jame's comment I want to add some info. > > When we enter sd_remove() which tries to flush the > > cache with SYNCHRONIZE CACHE We are seeing system hung. In my guess, MPT driver is not even receiving command for synchronize cache. (If I refer back trace provided in first mail, scsi_dispatch_cmd() might not be called. Back trace suggests hang in scsi_get_command() just before scsi_dispatch_cmd() ) > > > > > The SYNCHRONIZE CACHE is blocked by the host being in error recovery. > Since the SYNCHRONIZE CACHE is being driven off the mpt work queue it will > block the scsi error handler thread from completing as the > mptscsih_host_reset leads to calling flush_workqueue which leads to this > deadlock. > > In my previous email I listed the lead up events. > > We start with blk_abort_queue scheduling error recovery (Issue previous reported). In theory the hang issue could occur in other error handler / device delete scenarios, but with much less probability. The Linux version also does not contain support for DID_TRANSPORT_DISRUPTED so this work around cannot be used. > One solution would be to correct the problem of blk_abort_queue getting called in these transport cases. I wanted to try and utilize the request information now that we have request based dm-mp (and once we settle on a proper mapping of the codes) , but that would not be an option in this kernel. A short term solution could also be looked into. Another option it appears would be to return DID_IMM_RETRY instead of DID_BUS_BUSY in fusion/mptscsih.c (SAS_LOGINFO_NEXUS_LOSS). It appears that this could come close to DID_TRANSPORT_DISRUPTED behavior in this kernel release. Or we can continue to look into solutions of not dead locking in recovery. > The second issue is that we continue through progressive error handling > steps when we do not need to as we believe the device needs further error > recovery. Leading the host reset routine being called. > > > Even if synchronize cache command reaches to mptsas, mptsas will return with DID_NO_CONNECT since hostdata is no more valid. > > Here is snippet of mptsas code. > > > > ------------------------------------------------------ > > mptsas_qcmd(struct scsi_cmnd *SCpnt, void (*done)(struct scsi_cmnd *)) > > { > > VirtDevice *vdevice = SCpnt->device->hostdata; > > > > if (!vdevice || !vdevice->vtarget || vdevice->vtarget->deleted) { > > SCpnt->result = DID_NO_CONNECT << 16; > > done(SCpnt); > > return 0; > > } > > ------------------------------------------------------ > > Thanks, > > Kashyap > > > > -----Original Message----- > > From: linux-scsi-owner@xxxxxxxxxxxxxxx [mailto:linux-scsi-owner@xxxxxxxxxxxxxxx] On Behalf Of Paul Smith > > Sent: Tuesday, July 07, 2009 8:04 PM > > To: James Bottomley > > Cc: Mike Anderson; linux-scsi@xxxxxxxxxxxxxxx; Mike Christie; Moore, Eric > > Subject: Re: [2.6.27.25] Hang in SCSI sync cache when a disk is removed--? > > > > Hi James; thanks for that examination; it's very helpful. > > > > Unfortunately Eric is on vacation until the middle of the month and we > > really need to resolve this issue this week if possible. I'm forwarding > > your message to the LSI developers we've been working with. > > > > MikeA: we're working on getting the sysrq "t" output in the meantime, > > just in case it's revealing. > > > > On Tue, 2009-07-07 at 08:58 -0500, James Bottomley wrote: > > > On Mon, 2009-07-06 at 23:25 -0700, Mike Anderson wrote: > > > > Paul Smith <paul@xxxxxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > I was expecting a little more output from the error handler thread, but > > > > the log does show a few things. > > > > > > > > It would be good if in the failing case you could provide a sysrq "t" > > > > output so I could understand where the reset handler is waiting. > > > > > > > > It appears there are a few things going on. > > > > 1.) The dm deactivate calling blk_abort_queue is leading to error handler > > > > activation. Similar to a previously described issue. > > > > http://permalink.gmane.org/gmane.linux.kernel.device-mapper.devel/8543 > > > > - This kernel does not have DID_TRANSPORT_DISRUPTED so that > > > > avoidance method cannot be used. > > > > 2.) The task aborts are completing, but the tur is most likely being > > > > failed with a response of DID_BUS_BUSY leading to continued recovery. > > > > 3.) We appear to be inside mpt_HardResetHandler, but need more info to > > > > understand where in the call chain. > > > > > > Actually, isn't the problem much simpler? > > > > > > The mptsas driver calls sas_port_delete() when the event occurs. This > > > deletes the rphy and invokes scsi_remove_target(). It looks like the > > > device had a write back cache, so part of scsi_remove_target() goes to > > > scsi_remove_device() which triggers sd_remove() which tries to flush the > > > cache with SYNCHRONIZE CACHE. > > > > > > This is the point at which the hang occurs. It seems that the mptsas > > > goes out to lunch when it sees a command to a device on a deleted port. > > > The remainder of the log is error handling trying to get the attention > > > of the mptsas firmware back again. > > > > > > This is a pretty huge problem because any set of commands can be racing > > > with surprise ejection ... there's no way we can gate it in the mid > > > layer. The behaviour we expect is that after surprise ejection, a > > > driver/device will automatically error (with something like > > > DID_NO_CONNECT) all commands for the ejected device. > > > > > > James > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -andmike > -- > Michael Anderson > andmike@xxxxxxxxxxxxxxxxxx > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -andmike -- Michael Anderson andmike@xxxxxxxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html