On Tuesday, January 08, 2013 09:27:55 AM Yinghai Lu wrote: > On Tue, Jan 8, 2013 at 8:50 AM, Thomas Renninger <trenn@xxxxxxx> wrote: > > megaraid_sas > > can you check if your initrd for kdump kernel has that driver and > module that it depends on like > scsi sas transport etc ? Removing the 5 patches and the disk works and the dump is written. I can look a bit further at the memmap=exactmap issue tomorrow. I can also double check above then, but I am rather sure about it already: I tried plain vanilla -> worked, dumping started I tried with only these 5 patches added -> no disk. Some questions: You try to initialize the PCI subsystem in a way the BIOS typically has to do it in kexec case? Reacting and trying to handle error condtitions more gracefully at the place where they are caught could be another approach which imo makes sense to implement in parallel. In my case for example I see: "Present field in the IRTE entry is clear" DMAR errors. I expect this comes from a device which still throws interrupts, but irq vector got not set-up or registered in the kexec'ed kernel. I could imagine this is the same error which happens when an irq is wrongly configured and spurious interrupts happen (but in irq remapped case). In my case it's not sever as I only see this message once, but according to another report, they see about 80 of such DMAR error messages per second. This seem to result in endless DMAR error interrupts and finally a dead system. I wonder whether the DMAR error handler could already invoke a PCIe reset. I found: int pci_set_pcie_reset_state(struct pci_dev *dev, enum pcie_reset_state state) which unfortunatly is only implemented for PPC, but would it make sense to implement this one and trigger function level reset if several specific DMAR errors are seen (or other PCI(e) error handlers get active?)? If this does not help the next step could be to stop DMAR error interrupt handling or other iommu commands to keep the machine alive, even if one device keeps firing interrupts to an unconfigured irq vector (or whatever other things could happen). Just some ideas... Comments appreciated. Thomas -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html