David, I received the following email from Bill Sumner addressing your earlier email. Jerry On Wed, 2014-04-30, David Woodhouse wrote: Addressing a portion of the last question first: >Was that option considered and discounted for some reason? It seems like >it would make sense. Considered ? Yes. It is an interesting idea. See technical discussion below. Discounted for some reason ? Not really. Only for lack of time. With only a limited amount of time, I focused on providing a clean set of patches on a recent baseline. >On Thu, 2014-04-24 at 18:36 -0600, Bill Sumner wrote: >> >> This patch set modifies the behavior of the Intel iommu in the crashdump kernel: >> 1. to accept the iommu hardware in an active state, >> 2. to leave the current translations in-place so that legacy DMA will continue >> using its current buffers until the device drivers in the crashdump kernel >> initialize and initialize their devices, >> 3. to use different portions of the iova address ranges for the device drivers >> in the crashdump kernel than the iova ranges that were in-use at the time >> of the panic. > >There could be all kinds of existing mappings in the DMA page tables, >and I'm not sure it's safe to preserve them. Actually I think it is safe to preserve them in a great number of cases, and there is some percentage of cases where something else will work better. Fortunately, history shows that the panicked kernel generates bad translations rarely enough that many crashdumps on systems without iommu hardware still succeed by simply allowing the DMA/IO to continue into the existing buffers. The patch set uses the same technique when the iommu is active. So I think the odds start out in our favor -- enough in our favor that we should implement the current patch set into Linux and then begin work to improve it. Since the patch set is currently available, and its technique has already been somewhat tested by three companies, the remaining path for including it into Linux is short. This would significantly improve the crashdump success rate when the Intel iommu is active. It would also provide a foundation for investigations into more advanced techniques that would further increase the success rate. Bad translations do still happen -- bad programming, tables getting hosed, etc. We would like to find a way to get a good dump in this extra percentage of cases. It sounds like you have some good ideas in this area. The iommu hardware guarantees that DMA/IO can only access the memory areas described in the DMA page tables. We can be quite sure that the only physical memory areas that any DMA/IO can access are ones that are described by the contents of the translation tables. This comes with a few caveats: 1. hw-passthru and the 'si' domain -- see discussion under a later question. The problem with these is that they allow DMA/IO access to any of physical memory. 2. The iommu hardware will use valid translation entries to the "wrong" place just as readily as it will use ones to the "right" place. Having the kdump kernel check-out the memory areas described by the tables before using them seems like a good idea. For instance: any DMA buffers from the panicked kernel that point into the kdump kernel's area would be highly suspect. 3. Random writes into the translate tables may well write known-bad values into fields or reserved areas -- values which will cause the iommu hardware to reject that entry. Unfortunately we cannot count on this happening, but it felt like a bright-spot worth mentioning. >What prevents the crashdump >kernel from trying to use any of the physical pages which are >accessible, and which could thus be corrupted by stray DMA? DMA into the kdump area will corrupt the kdump and cause loss of the dump. Note that this was the original problem with disabling the iommu at the beginning of the kdump kernel which forced DMA that was going to its original (good) buffers to begin going into essentially random places -- almost all of them "wrong". However, I believe that the kdump kernel itself will not be the problem for the following reasons: As I understand the kdump architecture, the kdump kernel is restricted to the physical memory area that was reserved for it by the platform kernel during its initialization (when the platform kernel presumably was still healthy.) The kdump kernel is assumed to be clean and healthy, so it will not be attempting to use any memory outside of what it is assigned -- except for reading pages of the panicked kernel in order to write them to the dump file. Assuming that the DMA page tables were checked to insure that no DMA page table points into the kdump kernel's reserved area, no stray DMA/IO will affect the kdump kernel. > >In fact, the old kernel could even have set up 1:1 passthrough mappings >for some devices, which would then be able to DMA *anywhere*. Surely we >need to prevent that? Yes, I agree. The 1:1 passthrough mappings seem to be problematic -- both the use of hw-passthrough by the iommu and the 'si' domain set up in the DMA page tables. These mappings completely bypass one of the basic reasons for using the iommu hardware -- to restrict DMA access to known-safe areas of memory. I would prefer that Linux not use either of these mechanisms unless it is absolutely necessary -- in which case it could be explicitly enabled. After all, there are probably still some (hopefully few) devices that absolutely require it. Also, there may be circumstances where a performance gain outweighs the additional risk to the crashdump. If the kdump kernel finds a 1:1 passthrough domain among the DMA page tables, the real issue comes if we also need that device for taking the crashdump. If we do not need it, then pointing all of that device's IOVAs at a safe buffer -- as you recommend -- looks like a good solution. If kdump does need it, I can think of two ways to handle things: 1. Just leave it. This is what happens when there is no hardware iommu active, and this has worked OK there for a long time. This option clearly depends upon the 1:1 passthrough device not being the problem. This is also what my patches do, since they are modeled on handling the DMA buffers in the same manner as when there is no iommu active. 2. As you suggest, create a safe buffer and force all of this device's IOVAs into it. Then begin mapping real buffers when the kdump kernel begins using the device. > >After the last round of this patchset, we discussed a potential >improvement where you point every virtual bus address at the *same* >physical scratch page. > >That way, we allow the "rogue" DMA to continue to the same virtual bus >addresses, but it can only ever affect one piece of physical memory and >can't have detrimental effects elsewhere. Just a few technical observations and questions that hopefully will help implement this enhancement: Since each device may eventually be used by the kdump kernel, then each device will need its own domain-id and its own set of DMA page tables so that the IOVAs requested by the kdump kernel can map that device's IOVAs to that device's buffers. As IO devices have grown smarter, many of them, particularly NICs and storage interfaces, use DMA for work queues and status-reporting vectors in addition to buffers of data to be transferred. Some experimenting and testing may be necessary to determine how these devices behave when the translation for the work queue is switched to a safe-buffer which does not contain valid entries for that device. Questions that came to mind as I thought about this proposal: 1. Does the iommu need to know when the device driver has reset the device and that it is safe to add translations to the DMA page tables? 2. If it needs to know, how does it know, since the device driver asking for an IOVA via the DMA subsystem is usually the first indication to the iommu driver about the device and this may not guarantee that the device driver has already reset the device at that point? 3. For any given device, which IOVAs will be mapped to the safe buffer ? a. Only the IOVAs active at the time of the panic, which would require scanning the existing DMA page tables to find them. b. All possible IOVAs ? This would seem to be a very large number of pages for the page tables -- especially since each device may need its own set of DMA page tables. There could still be only one "safe data buffer" with a lot of page tables pointing to it. c. Determine these "on the fly" by capturing DMAR faults or some similar mechanism ? d. Other possibilities ? > >Was that option considered and discounted for some reason? It seems like >it would make sense. -- Bill Sumner Forwarded by Jerry Hoemann -- ---------------------------------------------------------------------------- Jerry Hoemann Software Engineer Hewlett-Packard 3404 E Harmony Rd. MS 57 phone: (970) 898-1022 Ft. Collins, CO 80528 FAX: (970) 898-XXXX email: jerry.hoemann@xxxxxx ---------------------------------------------------------------------------- -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html