Re: [PATCH v10 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Thu, 11 Jun 2015 16:40:12 +0100

On Fri, 2015-04-10 at 16:42 +0800, Li, Zhen-Hua wrote:
> This patchset is an update of Bill Sumner's patchset, implements a fix for:
> If a kernel boots with intel_iommu=on on a system that supports intel vt-d, 
> when a panic happens, the kdump kernel will boot with these faults:

But, in the general case, it *does* boot.

There are two cases where it doesn't actually boot, and those are the
interesting ones.

Firstly, a device just keeps generating faults and we die in an
interrupt storm, reporting the same fault over and over again. That can
actually happen without kdump/kexec and the correct fix for that is to
have rate-limiting, disable fault reporting for the offending device
after too many are seen, and then eventually to tie it in to the PCIe
error handling as has been discussed elsewhere.

Secondly, there are devices which do not correctly respond to a
hardware reset. This is broken hardware, and if we really have to copy
the old contexts from the crashed kernel to work around it then I'd
like it to be on a blacklist basis — we do it only for hardware which
is *known* to be broken in this way.

(There's also some cases where the device driver doesn't even *try* to
reset the hardware and just assumes it'll find it in a sane state as
the BIOS or a cleanly shut down kexec would have left it. In those
cases of course we can just fix the driver).

I don't much like the idea of doing this context copy for *all*
hardware. That's masking hardware issues with reset that we really
*ought* to be finding.

I believe that most of the offending hardware is HP's; they like to do
the most, erm, "interesting" things with odd hardware and RMRRs and
stuff. So Zhen-Hua would you be able to provide the list of broken
devices that HP has shipped, for the purpose of such a blacklist?

I assume you've already contacted the hardware folks responsible and
insisted that their devices are fixed to be resettable already, right?

-- 
dwmw2
Attachment:
smime.p7s

Description: S/MIME cryptographic signature