Hi Andrew, * Andrew Vasquez > There's a slew of problem reports noted on the web with this 'APIC > error' signature... From the qla2xxx driver perspective the following > logs show the classic 'no interrupts being routed' failures: Yes - I suspect this might not have anything to do with the HBA at all. It is a bit odd that it is always the qla2xxx driver that runs into trouble, for instance will I/O to the local hard drives continue to work (which is fortunate as that's where I have the kernel logs). > So do the abort requests fail with the similar signature (timeout)? The log isn't edited so if it doesn't say then I don't know. I/O service never recovers after the crash, so the multipath maps blocks all I/O until the machine is rebooted (which the remaining cluster members take care of within a minute). > There's a blanket suggestion that has helped others (perhaps by > ignoring the problem), disable the APIC: > > apm=force noapic acpi=off pci=noacpi > > but that seems like a bandaid. I'd suggest you work this through your > IBM support contract, if possible. I will try to do both, thank you for the suggestions. I fear IBM will hang up on me for not running SuSE or Red Hat, though... > BTW: I'd like to take a look at several failure iterations, could you > send the messages file during the failures... Okay, sent you the (unedited) kern.log since the last log rotation. It contains several crash events, as well as the bootup messages (left them in there in case there's anything interesting for you to see). I have many more crash events in the rotated logs. If you want I can send you those too (maybe off list due to their size), just say so. They all look the same, though: APIC errors followed by qla2xxx attempting to fix it, but the rports never recover and in the end the machine is rebooted by another cluster node. Regards, -- Tore Anderson
Attachment:
kern.log.gz
Description: GNU Zip compressed data