[PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO

joro@xxxxxxxxxx (Joerg Roedel) · Wed, 22 Oct 2014 15:21:58 +0200

Hi Bjorn,

On Tue, Oct 21, 2014 at 08:16:46PM -0600, Bjorn Helgaas wrote:
> I was looking at Zhen-Hua's recent patches, trying to figure out if I
> need to do anything with them.  Resetting devices in the old kernel
> seems like a non-starter.  Resetting devices in the new kernel, ...,
> well, maybe.  It seems ugly, and it seems like the sort of problem
> that IOMMUs are designed to solve.

Actually resetting the devices in the kdump kernel would be one of the
better solutions for this problem. When we have a generic way to stop
all in-flight DMA from the PCI endpoints we could safely disable and
then re-enable the IOMMU.

> On Wed, Jul 2, 2014 at 7:32 AM, Joerg Roedel <joro at 8bytes.org> wrote:
> > That is a solution to prevent the in-flight DMA failures. But what
> > happens when there is some in-flight DMA to a disk to write some inodes
> > or a new superblock. Then this scratch address-space may cause
> > filesystem corruption at worst.
> 
> This in-flight DMA is from a device programmed by the old kernel, and
> it would be reading data from the old kernel's buffers.  I think
> you're suggesting that we might want that DMA read to complete so the
> device can update filesystem metadata?

Well, it is not about updating filesystem metadata. In the kdump kernel
we have these options:

	1) Disable the IOMMU. Problem here is, that DMA is now
	   untranslated, so that any in-flight DMA might read or write
	   from a random location in memory, corrupting the kdump or
	   even the new kexec kernel memory. So this is a non-starter.

	2) Re-program the IOMMU to block all DMA. This is safer as it
	   does not corrupt any data in memory. But some devices react
	   very poorly on a master abort from the IOMMU, so bad that the
	   driver in the kdump kernel fails to initialize that device.
	   In this case taking dump itself might fail (and often does,
	   according to reports)

	3) To prevent master aborts like in option (2), David suggested
	   to map the whole DMA address space to a scratch page. This
	   also prevents system memory corruption and the master aborts.
	   But the problem is, that in-flight DMA will now read all
	   zeros. This can corrupt disk and network data, at worst it
	   nulls out the superblocks of your filesystem and you lose all
	   data. So this is not an option too.

What we currently do in the VT-d driver is a mixture of (1) and (2). The
VT-d driver disables the IOMMU hardware (opening a race window for
memory data corruption), re-initializes it to reject any ongoing DMA
(which causes master aborts for in-flight DMA) and re-enables the IOMMU
hardware.

But this strategy fails in heavy IO environments quite often and we look
into ways to solve the problem, or at least improve the current
situation as good as we can.

I talked to David about this at LPC and we came up with basically this
procedure:

	1. If the VT-d driver finds the IOMMU enabled, it reuses its
	   root-context table. This way the IOMMU must not be disabled
	   and re-enabled, eliminating the race we have now.
	   Other data structures like the context-entries are copied
	   over from the old kernel.  We basically keep all mappings
	   from the old kernel, allowing any in-flight DMA to succeed.
	   No memory or disk data corruption.
	   The issue here is, that we modify data from the old kernel
	   which is about to be dumped. But there are ways to work
	   around that.

	2. When a device driver issues the first dma_map command for a
	   device, we assign a new and empty page-table, thus removing
	   all mappings from the old kernel for the device.
	   Rationale is, that at this point the device driver should
	   have reset the device to a point where all in-flight DMA is
	   canceled.

This approach goes into the same direction as Bill Sumners patch-set,
which Li took over. But it goes not as far as keeping the old mappings
while the kdump kernel is still working with the devices (which might
introduce new issues and corner cases).

> > So with this in mind I would prefer initially taking over the
> > page-tables from the old kernel before the device drivers re-initialize
> > the devices.
> 
> This makes the dump kernel more dependent on data from the old kernel,
> which we obviously want to avoid when possible.

Sure, but this is not really possible here (unless we have a generic and
reliable way to reset all PCI endpoint devices and cancel all in-flight
DMA before we disable the IOMMU in the kdump kernel).
Otherwise we always risk data corruption somewhere, in system memory or
on disk.

> I didn't find the previous discussion where pointing every virtual bus
> address at the same physical scratch page was proposed.  Why was that
> better than programming the IOMMU to reject every DMA?

As I said, the problem is that this causes master aborts which some
devices can't recover from, so that the device driver in the kdump
kernel fails to initialize the device.

	Joerg