RE: [PATCH 1/1] megaraid_{mm,mbox}: fix a bug in reset handler

Andre Hedrick <andre@xxxxxxxxxxxxx> · Tue, 18 Apr 2006 09:33:28 -0700 (PDT)

Seokmann,

In the case where the card or the host issues a pci master abort that
wedges the adapter and marks adapter->hw_error = 1 and then never recovers
again, where is the code to bring the card back online?  I know first hand
this is a problem because I am working with the firmware folk to address
this issue.

Also please explain how changing the number of times one loops from 5 F's
to 6 F's allows the FW to continue?  I am betting it I take the new driver
and the firmware posted on LSI's websight it will still crash in the same
conditions.  Anybody up for lunch paid by the non-winning party?  I have
not tested it yet but everything I see does not address the fundamental
issues.

Cheers,

Andre Hedrick
LAD Storage Consulting Group

On Tue, 18 Apr 2006, Ju, Seokmann wrote:

> Hi,
> 
> I've seen the patch (megaraid_mmmbox_fix_a_bug_in_reset_handler.patch) available on 2.6.17-rc1-mm3 under "SCSI warning fix" section.
> What should I do to remove "warning" tag on the patch.
> I've attached another patch in previous email that has 'udelay()' in the loop to remove NMI concern, and waiting for confirmation on it. Will this change remove the "warning"?
> 
> I'll submit the patch officially by end of today.
> 
> Any comment would be appreciated.
> 
> Thank you,
> 
> 
> > -----Original Message-----
> > From: Ju, Seokmann 
> > Sent: Monday, April 17, 2006 9:13 AM
> > To: 'Andre Hedrick'; Andrew Morton
> > Cc: James.Bottomley@xxxxxxxxxxxx; 
> > linux-kernel@xxxxxxxxxxxxxxx; linux-scsi@xxxxxxxxxxxxxxx
> > Subject: RE: [PATCH 1/1] megaraid_{mm,mbox}: fix a bug in 
> > reset handler
> > 
> > Hi,
> > 
> > Thank you all for comment on the issue.
> > From the comment, it looks having 'ndelay/udelay' would be 
> > right way to address the issue.
> > I've attached a patch for just review purpose.
> > Once it confirmed, I'll post the patch officially.
> > 
> > Thank you,
> > 
> > > -----Original Message-----
> > > From: Andre Hedrick [mailto:andre@xxxxxxxxxxxxx] 
> > > Sent: Saturday, April 15, 2006 3:10 AM
> > > To: Andrew Morton
> > > Cc: Ju, Seokmann; Ju, Seokmann; James.Bottomley@xxxxxxxxxxxx; 
> > > linux-kernel@xxxxxxxxxxxxxxx; linux-scsi@xxxxxxxxxxxxxxx
> > > Subject: Re: [PATCH 1/1] megaraid_{mm,mbox}: fix a bug in 
> > > reset handler
> > > 
> > > 
> > > Andrew,
> > > 
> > > This is real, and is a known bug which is 100% reproducable (sp).
> > > There are other harry issues too, but this is as much as I can say.
> > > 
> > > cpu_relax() will not work, already tried some time ago.
> > > 
> > > 
> > > Andre Hedrick
> > > LAD Storage Consulting Group
> > > 
> > > On Wed, 12 Apr 2006, Andrew Morton wrote:
> > > 
> > > > "Ju, Seokmann" <Seokmann.Ju@xxxxxxxx> wrote:
> > > > >
> > > > > This patch has fix for a bug in the 'megaraid_reset_handler()'.
> > > > > 
> > > > >  When abort failed, the driver gets reset handleer 
> > > called. In the reset
> > > > >  handler, driver calls 'scsi_done()' callback for same 
> > > SCSI command
> > > > >  packet (struct scsi_cmnd) multiple times if there are 
> > > multiple SCSI
> > > > >  command packet in the pend_list. More over, if there are 
> > > entry in the
> > > > >  pend_lsit with IOCTL packet associated, the driver 
> > > returns it to wrong
> > > > >  free_list so that, in turn, the driver could end up with 
> > > 'NULL pointer
> > > > >  dereference..' during I/O command building with 
> > > incorrect resource.
> > > > > 
> > > > >  Also, the patch contains several minor/cosmetic changes 
> > > besides this.
> > > > >
> > > > > ..
> > > > >
> > > > > @@ -2655,32 +2655,48 @@
> > > > >  	// Also, reset all the commands currently owned 
> > by the driver
> > > > >  	spin_lock_irqsave(PENDING_LIST_LOCK(adapter), flags);
> > > > >  	list_for_each_entry_safe(scb, tmp, 
> > &adapter->pend_list, list) {
> > > > > -
> > > > >  		list_del_init(&scb->list);	// from 
> > pending list
> > > > >  
> > > > > -		con_log(CL_ANN, (KERN_WARNING
> > > > > -			"megaraid: %ld:%d[%d:%d], reset from 
> > > pending list\n",
> > > > > -				scp->serial_number, scb->sno,
> > > > > -				scb->dev_channel, 
> > scb->dev_target));
> > > > > +		if (scb->sno >= MBOX_MAX_SCSI_CMDS) {
> > > > > +			con_log(CL_ANN, (KERN_WARNING 
> > > > > +			"megaraid: IOCTL packet with %d[%d:%d] 
> > > being reset\n",
> > > > > +			scb->sno, scb->dev_channel, 
> > scb->dev_target));
> > > > >  
> > > > > -		scp->result = (DID_RESET << 16);
> > > > > -		scp->scsi_done(scp);
> > > > > +			scb->status = -EFAULT;
> > > > 
> > > > What is the significance of -EFAULT here?  Seems inappropriate?
> > > > 
> > > > > @@ -2918,12 +2933,12 @@
> > > > >  	wmb();
> > > > >  	WRINDOOR(raid_dev, raid_dev->mbox_dma | 0x1);
> > > > >  
> > > > > -	for (i = 0; i < 0xFFFFF; i++) {
> > > > > +	for (i = 0; i < 0xFFFFFF; i++) {
> > > > >  		if (mbox->numstatus != 0xFF) break;
> > > > >  		rmb();
> > > > >  	}
> > > > 
> > > > Oh my.  That's an awfully long interrupts-off spin.  1.7e7 
> > > operations with
> > > > an NMI watchdog timeout of five seconds - I'm surprised it 
> > > doesn't trigger.
> > > > 
> > > > Is that reading from a PCI register there?   Or main memory?
> > > > 
> > > > I'm somewhat surprised that the compiler never "optimises" 
> > > this into a
> > > > lockup, actually.  That's what `volatile' is for.
> > > > 
> > > > Is it not possible to do this with an interrupt?
> > > > 
> > > > A `cpu_relax()' in that loop would help cool things down a bit.
> > > > 
> > > > 
> > > > -
> > > > : send the line "unsubscribe 
> > > linux-scsi" in
> > > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > 
> > > 
> 

-
: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html