On Wed, Nov 19 2008, Miller, Mike (OS Dev) wrote: > > > > -----Original Message----- > > From: Randy Dunlap [mailto:randy.dunlap@xxxxxxxxxx] > > Sent: Wednesday, November 19, 2008 11:23 AM > > To: Miller, Mike (OS Dev) > > Cc: Jens Axboe; scsi; James Bottomley; lkml; akpm > > Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr > > > > Miller, Mike (OS Dev) wrote: > > > > > >> -----Original Message----- > > >> From: Jens Axboe [mailto:jens.axboe@xxxxxxxxxx] > > >> Sent: Wednesday, November 19, 2008 2:52 AM > > >> To: Randy Dunlap > > >> Cc: scsi; Miller, Mike (OS Dev); James Bottomley; lkml; akpm > > >> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr > > >> > > >> On Tue, Nov 18 2008, Randy Dunlap wrote: > > >>> Randy Dunlap wrote: > > >>>> Randy Dunlap wrote: > > >>>>> Miller, Mike (OS Dev) wrote: > > >>>>>>> -----Original Message----- > > >>>>>>> From: Randy Dunlap [mailto:randy.dunlap@xxxxxxxxxx] > > >>>>>>> Sent: Thursday, September 25, 2008 3:40 PM > > >>>>>>> To: scsi > > >>>>>>> Cc: Jens Axboe; Miller, Mike (OS Dev); James Bottomley; lkml; > > >>>>>>> akpm > > >>>>>>> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr > > >>>>>>> > > >>>>>>> On Thu, 25 Sep 2008 13:33:07 -0700 Randy Dunlap wrote: > > >>>>>>> > > >>>>>>>> Jens Axboe wrote: > > >>>>>>>>> On Thu, Sep 04 2008, Miller, Mike (OS Dev) wrote: > > >>>>>>>>>>>>>> 0x3bb2 <do_cciss_intr+1649>: mov 0x2(%r8),%dx > > >>>>>>>>>>>>>> 0x3bb7 <do_cciss_intr+1654>: test %dx,%dx > > >>>>>>>>>>>>>> 0x3bba <do_cciss_intr+1657>: je 0x3f0e > > >>>>>>> <do_cciss_intr+2509> > > >>>>>>>>>>>>>> $ addr2line -e cciss.o -f do_cciss_intr+0x627 > > >>>>>>>>>>>>>> SA5_fifo_full > > >>>>>>>>>>>>>> > > >> /home/rdunlap/linsrc/linux-2.6.27-rc3-git7/drivers/block/cciss.h: > > >>>>>>> 2 > > >>>>>>>>>>> 06 > > >>>>>>>>>>>>> OK ...that's confusing. It seems to be saying that > > >>>>>>> ctrlr_info_t > > >>>>>>>>>>>>> * was NULL. However, I can't see a way of > > >> getting into the > > >>>>>>>>>>> fifo_full > > >>>>>>>>>>>>> callback from do_cciss_intr .. > > >>>>>>>>>>>>> especially not with an NULL host. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> James > > >>>>>>>>>>>> That is weird. Even if we could get there > > >> fifo_full doesn't > > >>>>>>>>>>> do anything but wait for a bit. > > >>>>>>>>>>> > > >>>>>>>>>>> Hi, > > >>>>>>>>>>> > > >>>>>>>>>>> This just happened again. This time it's on > > >> 2.6.27-rc5-git3. > > >>>>>>>>>>> ~Randy > > >>>>>>>>>> Thanks Randy. I think. :) > > >>>>>>>>>> > > >>>>>>>>>> I'll try to recreate in my lab. > > >>>>>>>>> This looks somewhat strange, mostly like 'c' is NULL > > >> and it's > > >>>>>>>>> oopsing in in removeQ (I don't think Randy's analysis is > > >>>>>>> correct in > > >>>>>>>>> assuming it's 'h' and it's in fifo_full). Given that 'c' > > >>>>>>> cannot be > > >>>>>>>>> NULL, it's c->prev or c->next that are NULL. > > >>>>> This BUG: has happened (now) 5 times today. Higher > > >> frequency than > > >>>>> usual for some reason. > > >>>>> > > >>>>> I enabled CCISS_DEBUG and added one printk in > > removeQ(). On the > > >>>>> first call > > >>>> s/first/second/ > > >>>> > > >>>> > > >>>>> to removeQ(), both c->next and c->prev are NULL. > > >>>>> > > >>>>> Here's the kernel log output from cciss: > > >>> I added a printk() in addQ() as well. Here's the new output: > > >>> > > >>> HP CISS Driver (v 3.6.20) > > >>> ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 54 cciss > > >> 0000:42:08.0: > > >>> PCI INT A -> Link[LNKA] -> GSI 54 (level, high) -> IRQ 54 > > command = > > >>> 147 irq = 36 board_id = 3211103c cciss 0000:42:08.0: irq 87 for > > >>> MSI/MSI-X address 0 = fdf80000 cfg base address = 10 cfg > > >> base address > > >>> index = 0 cfg offset = 400 Controller Configuration information > > >>> ------------------------------------ > > >>> Signature = CISS > > >>> Spec Number = 1 > > >>> Transport methods supported = 0x6 > > >>> Transport methods active = 0x3 > > >>> Requested transport Method = 0x0 > > >>> Coalesce Interrupt Delay = 0x0 > > >>> Coalesce Interrupt Count = 0x1 > > >>> Max outstanding commands = 0x256 > > >>> Bus Types = 0x200000 > > >>> Server Name = > > >>> Heartbeat Counter = 0x1672 > > >>> > > >>> > > >>> Trying to put board into Simple mode I counter got to 1 0 > > Controller > > >>> Configuration information > > >>> ------------------------------------ > > >>> Signature = CISS > > >>> Spec Number = 1 > > >>> Transport methods supported = 0x6 > > >>> Transport methods active = 0x3 > > >>> Requested transport Method = 0x0 > > >>> Coalesce Interrupt Delay = 0x0 > > >>> Coalesce Interrupt Count = 0x1 > > >>> Max outstanding commands = 0x256 > > >>> Bus Types = 0x200000 > > >>> Server Name = > > >>> Heartbeat Counter = 0x1672 > > >>> > > >>> > > >>> cciss0: <0x3238> at PCI 0000:42:08.0 IRQ 87 using DAC > > >>> cciss: intr_pending 8 > > >>> cciss: addQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000 > > >>> cciss: removeQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000, > > >>> next=ffff88007f83e000, prev=ffff88007f83e000 Sending > > >> 7f83e000 - down > > >>> to controller > > >>> cciss: addQ: Qptr=ffff88027e0100c0, c=ffff88007f83e000 > > >>> cciss: intr_pending 8 > > >>> cciss: Read 4 back from board > > >>> cciss: removeQ: Qptr=ffff88027e0100c0, c=ffff88007f840000, > > >>> next=0000000000000000, prev=0000000000000000 > > >>> BUG: unable to handle kernel NULL pointer dereference at > > >>> 0000000000000248 > > >> Randy, can you post the debug patch you used? The above goes boom > > >> when it attempts to remove a command that isn't on the > > list, the Qptr > > >> in the last example should be empty, hence the oops. So I'd be > > >> interested in seeing what removeQ() calls this is, I'm > > assuming it's > > >> this bit in > > >> do_cciss_intr(): > > >> > > >> ... > > >> while (c->busaddr != a) { > > >> c = c->next; > > >> if (c == h->cmpQ) > > >> break; > > >> } > > >> } > > >> /* > > >> * If we've found the command, take it off the > > >> * completion Q and free it > > >> */ > > >> if (c->busaddr == a) { > > >> removeQ(&h->cmpQ, c); > > >> if (c->cmd_type == CMD_RWREQ) { > > >> complete_command(h, c, 0); > > >> ... > > >> > > >> If so, what part of the c lookup are you hitting - the on > > that does: > > >> > > >> c = h->cmd_pool + a2; > > >> > > >> or the c->busaddr check that his shown above? > > >> > > >> -- > > > Randy, > > > I still can't reproduce this bug. I have your config file > > on a BL465c w/e200i. Just to confirm, you only see this at > > init time, correct? > > > > Yes, only at init time. > > > > > Please post your debug patch as Jens requested. > > > > Done (separately). > > > > I need to back up a bit. Yesterday these BUGs happened > > consistenly, so I wondered why. Then I recalled that for > > debugging another bug/problem, I had changed the test > > system's normal boot kernel from 2.6.25 to 2.6.18-8. The > > test system is used to build and then boot the new kernel > > *via kexec*, so it's quite possible (or certain) that > > something in the kexec world has been fixed since 2.6.18. I > > don't recall seeing this problem lately when using 2.6.25 to > > kexec/boot the new test kernel, so I'm quite willing to drop > > the bug for now and then re-open it if I see the problem again. OK?? > > Ahhhh, the kexec piece was missing. Now I don't feel quite so > clueless. I'm OK with dropping the bug for now. Jens, James? Yeah, kexec is definitely a clue. My guess is that we got some sort of left over completion. Regardless of the status of this particular bug or not, I think it would be a good idea to add some checks for when a command is attempted removed from a queue it isn't currently on. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html