On Fri, 2015-04-10 at 16:42 +0800, Li, Zhen-Hua wrote: > This patchset is an update of Bill Sumner's patchset, implements a fix for: > If a kernel boots with intel_iommu=on on a system that supports intel vt-d, > when a panic happens, the kdump kernel will boot with these faults: But, in the general case, it *does* boot. There are two cases where it doesn't actually boot, and those are the interesting ones. Firstly, a device just keeps generating faults and we die in an interrupt storm, reporting the same fault over and over again. That can actually happen without kdump/kexec and the correct fix for that is to have rate-limiting, disable fault reporting for the offending device after too many are seen, and then eventually to tie it in to the PCIe error handling as has been discussed elsewhere. Secondly, there are devices which do not correctly respond to a hardware reset. This is broken hardware, and if we really have to copy the old contexts from the crashed kernel to work around it then I'd like it to be on a blacklist basis — we do it only for hardware which is *known* to be broken in this way. (There's also some cases where the device driver doesn't even *try* to reset the hardware and just assumes it'll find it in a sane state as the BIOS or a cleanly shut down kexec would have left it. In those cases of course we can just fix the driver). I don't much like the idea of doing this context copy for *all* hardware. That's masking hardware issues with reset that we really *ought* to be finding. I believe that most of the offending hardware is HP's; they like to do the most, erm, "interesting" things with odd hardware and RMRRs and stuff. So Zhen-Hua would you be able to provide the list of broken devices that HP has shipped, for the purpose of such a blacklist? I assume you've already contacted the hardware folks responsible and insisted that their devices are fixed to be resettable already, right? -- dwmw2
Attachment:
smime.p7s
Description: S/MIME cryptographic signature