On Tue, 2008-09-23 at 02:29 +0000, Eric W. Biederman wrote: > Bob Montgomery <bob.montgomery at hp.com> writes: > > And that leads to the Kdump IO Rule: > > > > The primary kernel is responsible for setting up any necessary > > conditions to allow the kdump kernel to perform its required > > IO without detecting any iommu. > > Reserving a range or addresses in the iommu I agree with. > If that range of addresses allows for identity mapping I > like it better. > > I'm not certain about requiring it. > > I don't like setting up the identity mapping before hand, > it allows devices to trash the kdump kernel by accident. The reason for having the primary kernel set up any mapping needed by a kdump kernel *in advance* is that for a HW IOMMU, this setup actually consists of modifying data structures (arrays, trees, lists) that are in the primary kernel's memory, as well as setting registers in the HW. When the kdump kernel comes up, none of those structures are in its memory range. They're just part of the artifacts left in /dev/oldmem. So yes, the kdump kernel could query any hardware that it found, verify that the hardware had previously been in use, read HW registers to get the root pointers, or list addresses or whatever, and then modify arrays, trees, or lists in that non-owned memory to map its DMA, but it's kind of an unprecedented step for the kdump kernel to take. (Blindly copying oldmem pages is one thing, manipulating live data structures in oldmem seems like quite another thing.) Regarding the danger of trashing the kdump kernel prior to its launch: Currently, any driver or errant kernel code can trash the kdump area. And any IO card on a non-IOMMU or swiotlb system can trash it. So it doesn't seem like much of an extension of a risk that already exists. It does however negate one possibility to lower some of that risk. > > The kdump kernel must refrain from detecting and initializing > > any iommu. > > Why? I can fully understand avoiding addresses that are in flight. > I can definitely see this being simpler in the kdump kernel. > However this feels like it makes a less robust kdump kernel by > not allowing it to touch the iommu. As pointed out above, "touching the iommu" really includes touching its data structures created by the primary kernel in what is now the oldmem area. In addition, I'm not sure the kdump kernel can determine which addresses are in flight by querying either the HW or the oldmem structures. It could probably determine which ones were unused at the time of the crash. > > This has a these effects: > > > > A) Primary kernel: depending on what it is using for as an IOMMU, > > it may have to do some (or considerable) setup, to guarantee > > that the kdump kernel can have IO capability to its Crash > > kernel address range. > > > > B) Primary kernel: the Crash kernel range must be set up in an address > > range whose physical addresses are accessible to IO cards > > without address remapping. > > Below <= 16MB? That doesn't work in general. I didn't think this was working now. Aren't most crash kernels allocated above 16 MB? And I assumed lots of systems don't have IOMMU capability. Do you have an example where this would be an issue for an IO card needed by the kdump kernel? > Especially not if we are running on an SGI box and someone had > unplugged node 0 (with all of the memory below 4G). How does an SGI box with an unplugged node 0 do kdump IO currently? > > Possible? Comments? Corrections? > > Possible. > > I would very much like the option of doing the iommu setup, and possibly > fiddling in the kdump kernel. As long as we are not reusing the same > addresses in the iommu I don't see a problem. The problem I see is the oldmem area. Now we could come up with a plan to allow the primary kernel to do all of its iommu related allocations in the Crash kernel area, effectively creating an area of memory that is shared between the primary kernel and kdump. (This would be complicated in cases where the iommu state is in a dynamic tree vs. a fixed size array.) Then the kdump kernel would wake up and just take over maintenance of the iommu. But even the much smaller proposal to preallocate entries in the iommu data structures to allow the kdump kernel to do its IO is already violating one of the principals of kdump. It is making kdump operation dependent on the integrity of a primary kernel data structure. Actually taking over a shared iommu data structure from the primary kernel seems like an even bigger philosophical violation. > I like the theoretical option of disabling ongoing DMA's, with the > more complete IOMMUs. It isn't strictly necessary but I expect it > would give a better result. It seems like this implies 1) stopping the DMA at the IOMMU, 2) surviving the resulting error condition when the IO card fails its next access (hopefully not MCE on a modern IOMMU), 3) verifying that the IO card won't try another access later after you've started using the IOMMU in the kdump kernel, and then 4) reinitializing and using the IOMMU. Is it doable? Thanks for considering, Bob Montgomery