Hi Bjorn, On Tue, Oct 21, 2014 at 08:16:46PM -0600, Bjorn Helgaas wrote: > I was looking at Zhen-Hua's recent patches, trying to figure out if I > need to do anything with them. Resetting devices in the old kernel > seems like a non-starter. Resetting devices in the new kernel, ..., > well, maybe. It seems ugly, and it seems like the sort of problem > that IOMMUs are designed to solve. Actually resetting the devices in the kdump kernel would be one of the better solutions for this problem. When we have a generic way to stop all in-flight DMA from the PCI endpoints we could safely disable and then re-enable the IOMMU. > On Wed, Jul 2, 2014 at 7:32 AM, Joerg Roedel <joro at 8bytes.org> wrote: > > That is a solution to prevent the in-flight DMA failures. But what > > happens when there is some in-flight DMA to a disk to write some inodes > > or a new superblock. Then this scratch address-space may cause > > filesystem corruption at worst. > > This in-flight DMA is from a device programmed by the old kernel, and > it would be reading data from the old kernel's buffers. I think > you're suggesting that we might want that DMA read to complete so the > device can update filesystem metadata? Well, it is not about updating filesystem metadata. In the kdump kernel we have these options: 1) Disable the IOMMU. Problem here is, that DMA is now untranslated, so that any in-flight DMA might read or write from a random location in memory, corrupting the kdump or even the new kexec kernel memory. So this is a non-starter. 2) Re-program the IOMMU to block all DMA. This is safer as it does not corrupt any data in memory. But some devices react very poorly on a master abort from the IOMMU, so bad that the driver in the kdump kernel fails to initialize that device. In this case taking dump itself might fail (and often does, according to reports) 3) To prevent master aborts like in option (2), David suggested to map the whole DMA address space to a scratch page. This also prevents system memory corruption and the master aborts. But the problem is, that in-flight DMA will now read all zeros. This can corrupt disk and network data, at worst it nulls out the superblocks of your filesystem and you lose all data. So this is not an option too. What we currently do in the VT-d driver is a mixture of (1) and (2). The VT-d driver disables the IOMMU hardware (opening a race window for memory data corruption), re-initializes it to reject any ongoing DMA (which causes master aborts for in-flight DMA) and re-enables the IOMMU hardware. But this strategy fails in heavy IO environments quite often and we look into ways to solve the problem, or at least improve the current situation as good as we can. I talked to David about this at LPC and we came up with basically this procedure: 1. If the VT-d driver finds the IOMMU enabled, it reuses its root-context table. This way the IOMMU must not be disabled and re-enabled, eliminating the race we have now. Other data structures like the context-entries are copied over from the old kernel. We basically keep all mappings from the old kernel, allowing any in-flight DMA to succeed. No memory or disk data corruption. The issue here is, that we modify data from the old kernel which is about to be dumped. But there are ways to work around that. 2. When a device driver issues the first dma_map command for a device, we assign a new and empty page-table, thus removing all mappings from the old kernel for the device. Rationale is, that at this point the device driver should have reset the device to a point where all in-flight DMA is canceled. This approach goes into the same direction as Bill Sumners patch-set, which Li took over. But it goes not as far as keeping the old mappings while the kdump kernel is still working with the devices (which might introduce new issues and corner cases). > > So with this in mind I would prefer initially taking over the > > page-tables from the old kernel before the device drivers re-initialize > > the devices. > > This makes the dump kernel more dependent on data from the old kernel, > which we obviously want to avoid when possible. Sure, but this is not really possible here (unless we have a generic and reliable way to reset all PCI endpoint devices and cancel all in-flight DMA before we disable the IOMMU in the kdump kernel). Otherwise we always risk data corruption somewhere, in system memory or on disk. > I didn't find the previous discussion where pointing every virtual bus > address at the same physical scratch page was proposed. Why was that > better than programming the IOMMU to reject every DMA? As I said, the problem is that this causes master aborts which some devices can't recover from, so that the device driver in the kdump kernel fails to initialize the device. Joerg