On Wed Mar 21, 2012, you wrote: > On Wed Mar 21, 2012, adam radford wrote: > > On Wed, Mar 21, 2012 at 4:16 PM, Thomas Fjellstrom <thomas@xxxxxxxxxxxxx> > > wrote: > > > I recently got an IBM M1015 (MegaRaid 9240-8i) card, and after getting > > > a new motherboard, the system now boots, but the megaraid_sas driver > > > seems to be getting stuck when trying to initialize the card. > > > > > > Looking through the source, it seems to be stuck in the > > > megasas_adp_reset_gen2 function, in the while loop at the end. Now, > > > according to the code it can't actually get stuck there permanently, > > > but it does take quite a while for the loop to finish, and the udev > > > timeout messages to stop. > > > > > > I've looked around quite a bit, but haven't found any solutions thus > > > far. If anyone could point me in the right direction I'd appreciate > > > it. > > > > If you are getting controller resets during driver load, you must not > > be getting interrupts or firmware is not responding to the inquiry > > roll-call. Make sure you have the latest firmware. > > I updated to the latest on LSI's site today before emailing. It changes the > behavior slightly. With the older firmware, it would not print any of the > initial reset messages, but would once udev decides to start killing > modprobe. With the new firmware, I get a: > > ADP_RESET_GEN2: HostDiag=a0 > > followed by a bunch of: > > RESET_GEN2: retry=%x, hostdiag=a4 > > Now I'm not sure the hostdiag should be different between the two. if this > aN identifier is similar to the aN identifiers in the MegaCli tool, then > it would mean its trying to reset a device that doesn't exist? I only have > a single M1015 card installed. > > > The code at the end of megasas_adp_reset_gen2() just looks for > > DIAG_RESET_ADAPTER flag to clear on the host diag register when > > issuing a controller reset... that should happen almost immediately > > unless there is a hardware or firmware issue. > > > > Are you sure your 'new' motherboard is actually good ? > > It boots and runs fine without the sas card installed. I haven't run any > heavy load tests, but it seems ok. Machine has been solid as a rock (sans 9240-8i) for the past month with mild to half load. It runs several virtual machines, a nfs share, my firewall, a minecraft server, and some other miscellaneous stuff. Not a single hiccup. > > Can you move your megaraid 9240-8i into a 'known working' system and > > re-test ? > > Nope. This is the furthest I've gotten it to get with this card installed. > The old system would fail to boot into grub properly, let alone linux. > These cards seem to be /very/ picky about what motherboard you install > them in. > > > -Adam I just got a second M1015 card in today and gave it a go. Similar issues, different log messages. (hand typed from picture taken of screens) Lots of: megasas: Waiting for 1 commands to complete for quite a while (5-10 minutes), along with udevd trying to kill modprobe. Then: megasas: moving cmd[0]:hexstringherewithcolons queue as internal megaraid_sas: FW detected to be in fault state, restarting it... ADP_RESET_GEN2: HostDiag=a0 megaraid_sas: FW restarted successfully,initializing next stage... megaraid_sas: HBA recovery state machine,state 1 starting... (sits here for a while) megasas: Waiting for FW to come to ready state megasas: FW now in ready state megaraid_sas: command hexstringhere, hexstringhere detected (something?) while HBA reset megasas: command hexstring scsi cmd [12]detected on the internal (something?) again megasas: reset successful scsi:0:0:0:0: megasas: RESET cmd=12 retries=0 megaraid_sas: no pending cmds after reset megasas: reset successful scsi:0:0:0:0: megasas: RESET cmd=12 retries=0 megaraid_sas: no pending cmds after reset megasas: reset successful scsi:0:0:0:0: Device offlined - not ready after error recovery (other scsi devices are detected) (bootup hangs here) Eventually theres some "hung task" timeout backtraces. This is where I tried to kill udevd, CTRL+C didn't stop it from trying to kill modprobe, and ALT+SYSRQ+K caused a silent oops (keyboard leds blinking, no backtrace or OOPS text). If its similar to last time, eventually the kernel will outright OOPS without any intervention. -- Thomas Fjellstrom thomas@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html