Dave Lloyd <dave at davelloyd.com> writes: > On Tue, May 14, 2013 at 5:33 PM, Eric W. Biederman > <ebiederm at xmission.com> wrote: >> Dave Lloyd <dave at davelloyd.com> writes: >> >>> On Tue, May 14, 2013 at 5:01 PM, Eric W. Biederman >>> <ebiederm at xmission.com> wrote: >>> >>>> >>>> Yes this does seem to be all over the place, and memory corruption >>>> probably caused by ongoing-dma seems like a reasonable hypothesis. >>> >>> Thank goodness it's not just me! :-) >> >> It is a classic issue, although I suspect something is unique in your >> setup because it has (to my knowledge) not been a widespread problem for >> years. > > It could certainly be buggy hardware. Other details include: > > Kernel 3.0.29.0 and we are also using infiniband (which I believe I > found a reference to the Mellanox hardware potentially causing this > issue unless the driver was unloaded before reboot with kexec). The > potential issue with unloading the IB drivers doesn't bug me nearly as > much as not unloading pata_amd and pata_acpi causing the ACPI Error > messages upon reboot with kexec. Oh. Yeah. IB definitely sets up memory for ongoing dma. So if it doesn't have a shutdown method and IB traffic comes in during boot just about anything cood happen. > I'm inclined to chalk the ACPI Error mesages up to potentially buggy > BIOS/hardware from the vendor since pata_amd and pata_acpi are in wide > use and I would expect to see more issues reported were there truly an > issue with rebooting with kexec and not unloading pata_amd and > pata_acpi. Maybe. Or it might be luck of timing, which memory was stomped when incomming IB packets stomped on memory. Eric