On Fri, 25 Apr 2008, Tore Anderson wrote: > * Andrew Vasquez > > > There's a slew of problem reports noted on the web with this 'APIC > > error' signature... From the qla2xxx driver perspective the following > > logs show the classic 'no interrupts being routed' failures: > > Yes - I suspect this might not have anything to do with the HBA at all. > It is a bit odd that it is always the qla2xxx driver that runs into > trouble, for instance will I/O to the local hard drives continue to work > (which is fortunate as that's where I have the kernel logs). > > > So do the abort requests fail with the similar signature (timeout)? > > The log isn't edited so if it doesn't say then I don't know. > > I/O service never recovers after the crash, so the multipath maps blocks > all I/O until the machine is rebooted (which the remaining cluster > members take care of within a minute). Hmm, MSI is enabled: qla2xxx 0000:08:01.1: Found an ISP2422, irq 26, iobase 0xffffc20000c54000 qla2xxx 0000:08:01.1: Configuring PCI space... qla2xxx 0000:08:01.1: Configure NVRAM parameters... qla2xxx 0000:08:01.1: Verifying loaded RISC code... scsi(2): **** Load RISC code **** scsi(2): Verifying Checksum of loaded RISC code. scsi(2): Checksum OK, start firmware. qla2xxx 0000:08:01.1: Allocated (64 KB) for EFT... qla2xxx 0000:08:01.1: Allocated (1413 KB) for firmware dump... scsi(2): Issue init firmware. qla2xxx 0000:08:01.1: MSI: Enabled. ... could you try disabling MSI via 'pci=nomsi' (I believe), we've dealt with a large number of problem reports where customers reported 'odd' behaviours (no interrupt routining) with several motherboard chipsets. At least it could be another useful datapoint... > > There's a blanket suggestion that has helped others (perhaps by > > ignoring the problem), disable the APIC: > > > > apm=force noapic acpi=off pci=noacpi > > > > but that seems like a bandaid. I'd suggest you work this through your > > IBM support contract, if possible. > > I will try to do both, thank you for the suggestions. I fear IBM will > hang up on me for not running SuSE or Red Hat, though... > > > BTW: I'd like to take a look at several failure iterations, could you > > send the messages file during the failures... > > Okay, sent you the (unedited) kern.log since the last log rotation. It > contains several crash events, as well as the bootup messages (left them > in there in case there's anything interesting for you to see). > > I have many more crash events in the rotated logs. If you want I can > send you those too (maybe off list due to their size), just say so. > They all look the same, though: APIC errors followed by qla2xxx > attempting to fix it, but the rports never recover and in the end the > machine is rebooted by another cluster node. Let's force the driver to operating in INTx mode... -- av -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html