On Thu, 2014-04-10 at 09:14 -0600, Bjorn Helgaas wrote: > > Thus, my first guess would be that we are quite happily setting up the > > requested DMA maps on the *wrong* IOMMU, and then taking faults when the > > device actually tries to do DMA. > > > I like the "wrong IOMMU (or no IOMMU at all)" theory. If we didn't > connect the device with an IOMMU at all, that would explain the device > DMAing directly to a physical address, wouldn't it? An unlikely failure mode. We're much more likely to see *wrong* IOMMU than no IOMMU. And thus we'd still see the distinctive virtual addresses just below 4GiB. However, Rob's answer may solve that puzzle. If this is one of those abominations where the device continues to do DMA to system memory even after the OS is up and running and *thinks* it has control of the hardware, then the offending address will be listed in an RMRR entry (which tells the OS to set up a 1:1 mapping for access to certain memory ranges for a given device). And will be inside an E820 reserved region. A little odd that such an error would trigger only when we're actually trying to initialise the device from the Linux driver, not as soon as we enable the IOMMU. But all things are possible. But the DMAR table and dmesg that I asked for would give us a bit more information and hopefully let us stop speculating... > > We should also rate-limit DMA faults, which would avoid the lockup > > failure mode. Bjorn, what should an IOMMU driver *do* when it detects > > that a device is creating an endless stream of DMA faults and isn't > > aborting the transaction? > > You mentioned that POWER with EEH does something intelligent in this > case, but I'm not familiar with that code. We have AER support, which > can result in resetting a device, but I think DMA faults are reported > differently, and I don't think there's any nice existing way for PCI > to deal with them. Maybe there should be, though. Quite frankly, I don't care how *you* deal with them, or even if you can. All I want to know is how I tell you about the problem, because *I* sure as hell don't want to be trying to deal with it in the IOMMU code. That's a generic PCI layer thing. :) -- David Woodhouse Open Source Technology Centre David.Woodhouse@xxxxxxxxx Intel Corporation
Attachment:
smime.p7s
Description: S/MIME cryptographic signature