[PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

indou.takao@xxxxxxxxxxxxxx (Takao Indoh) · Thu, 13 Jun 2013 12:25:31 +0900

(2013/06/12 22:19), Don Dutile wrote:
> On 06/11/2013 07:19 PM, Sumner, William wrote:
>>
>>> (2013/06/11 11:20), Bjorn Helgaas wrote:
>>>> On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh<indou.takao at jp.fujitsu.com>  wrote:
>>>>> (2013/06/07 13:14), Bjorn Helgaas wrote:
>>>>
>>>>>> One thing I'm not sure about is that you are only resetting PCIe
>>>>>> devices, but I don't think the problem is actually specific to PCIe,
>>>>>> is it?  I think the same issue could occur on any system with an
>>>>>> IOMMU.  In the x86 PC world, most IOMMUs are connected with PCIe, but
>>>>>> there are systems with IOMMUs for plain old PCI devices, e.g.,
>>>>>> PA-RISC.
>>>>>
>>>>> Right, this is not specific to PCIe. The reasons why the target is only
>>>>> PCIe is just to make algorithm to reset simple. It is possible to reset
>>>>> legacy PCI devices in my patch, but code becomes somewhat complicated. I
>>>>> thought recently most systems used PCIe and there was little demand for
>>>>> resetting legacy PCI. Therefore I decided not to reset legacy PCI
>>>>> devices, but I'll do if there are requests :-)
>>>>
>>>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>>>> yet, but the current hook isn't anchored anywhere -- it's just an
>>>> fs_initcall() that doesn't give the reader any clue about the
>>>> connection between the reset and the problem it's solving.
>>>>
>>>> If we do something like this patch, I think it needs to be done at the
>>>> point where we enable or disable the IOMMU.  That way, it's connected
>>>> to the important event, and there's a clue about how to make
>>>> corresponding fixes for other IOMMUs.
>>>
>>> Ok. pci_iommu_init() is appropriate place to add this hook?
>>>
>>>> We already have a "reset_devices" boot option.  This is for the same
>>>> purpose, as far as I can tell, and I'm not sure there's value in
>>>> having both "reset_devices" and "pci=pcie_reset_endpoint_devices".  In
>>>> fact, there's nothing specific even to PCI here.  The Intel VT-d docs
>>>> seem carefully written so they could apply to either PCIe or non-PCI
>>>> devices.
>>>
>>> Yeah, I can integrate this option into reset_devices. The reason why
>>> I separate them is to avoid regression.
>>>
>>> I have tested my patch on many machines and basically it worked, but I
>>> found two machines where this reset patch caused problem. The first one
>>> was caused by bug of raid card firmware. After updating firmware, this
>>> patch worked. The second one was due to bug of PCIe switch chip. I
>>> reported this bug to the vendor but it is not fixed yet.
>>>
>>> Anyway, this patch may cause problems on such a buggy machine, so I
>>> introduced new boot parameter so that user can enable or disable this
>>> function aside from reset_devices.
>>>
>>> Actually Vivek Goyal, kdump maintainer said same thing. He proposed using
>>> reset_devices instead of adding new one, and using quirk or something to
>>> avoid such a buggy devices.
>>
>> With respect to "and using quirk or something to avoid such buggy devices",
>> I believe that it will be necessary to provide a mechanism for devices that
>> need special handling to do the reset -- perhaps something like a list
>> of tuples: (device_type, function_to_call) with a default function_to_call
>> when the device_type is not found in the list.  These functions would
>> need to be physically separate from the device driver because if the device
>> is present it needs to be reset even if the crash kernel chooses not to load
>> the driver for that device.
>>
>>>
>>> So, basically I agree with using reset_devices, but I want to prepare
>>> workaround in case this reset causes something wrong.
>>>
>> I like the ability to specify the original "reset_devices" separately from
>> invoking this new mechanism.  With so many different uses for Linux in
>> so many different environments and with so many different device drivers
>> it seems reasonable to keep the ability to tell the device drivers to
>> reset their devices -- instead of pulling the reset line on all devices.
>>
>> I also like the ability to invoke the new reset feature separately from
>> telling the device drivers to do it.
>>
>>>
>>>>
>>>>>> I tried to make a list of the interesting scenarios and the events
>>>>>> that are relevant to this problem:
>>>>>>
>>>>>>        Case 1: IOMMU off in system, off in kdump kernel
>>>>>>          system kernel leaves IOMMU off
>>>>>>            DMA targets system-kernel memory
>>>>>>          kexec to kdump kernel (IOMMU off, devices untouched)
>>>>>>            DMA targets system-kernel memory (harmless)
>>>>>>          kdump kernel re-inits device
>>>>>>            DMA targets kdump-kernel memory
>>>>>>
>>>>>>        Case 2: IOMMU off in system kernel, on in kdump kernel
>>>>>>          system kernel leaves IOMMU off
>>>>>>            DMA targets system-kernel memory
>>>>>>          kexec to kdump kernel (IOMMU off, devices untouched)
>>>>>>            DMA targets system-kernel memory (harmless)
>>>>>>          kdump kernel enables IOMMU with no valid mappings
>>>>>>            DMA causes IOMMU errors (annoying but harmless)
>>>>>>          kdump kernel re-inits device
>>>>>>            DMA targets IOMMU-mapped kdump-kernel memory
>>>>>>
>>>>>>        Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU
>>>>>>          system kernel enables IOMMU
>>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>>          kexec to kdump kernel (IOMMU on, devices untouched)
>>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>>          kdump kernel doesn't know about IOMMU or doesn't touch it
>>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>>          kdump kernel re-inits device
>>>>>>            kernel assumes no IOMMU, so all new DMA mappings are invalid
>>>>>> because DMAs actually do go through the IOMMU
>>>>>>            (** corruption or other non-recoverable error likely **)
>>>>>>
>>>>>>
>>>>>>        Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU
>>>>>>          system kernel enables IOMMU
>>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>>          kexec to kdump kernel (IOMMU on, devices untouched)
>>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>>          kdump kernel disables IOMMU
>>>>>>            DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled
>>>>>>            (** corruption or other non-recoverable error likely **)
>>>>>>          kdump kernel re-inits device
>>>>>>            DMA targets kdump-kernel memory
>>>>>>
>>>>>>        Case 4: IOMMU on in system kernel, on in kdump kernel
>>>>>>          system kernel enables IOMMU
>>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>>          kexec to kdump kernel (IOMMU on, devices untouched)
>>>>>>            DMA targets IOMMU-mapped system-kernel memory
>>>>>>          kdump kernel enables IOMMU with no valid mappings
>>>>>>            DMA causes IOMMU errors (annoying but harmless)
>>>>>>          kdump kernel re-inits device
>>>>>>            DMA targets IOMMU-mapped kdump-kernel memory
>>>>>
>>>>> This is not harmless. Errors like PCI SERR are detected here, and it
>>>>> makes driver or system unstable, and kdump fails. I also got report that
>>>>> system hangs up due to this.
>>>>
>>>> OK, let's take this slowly.  Does an IOMMU error in the system kernel
>>>> also cause SERR or make the system unstable?  Is that the expected
>>>> behavior on IOMMU errors, or is there something special about the
>>>> kdump scenario that causes SERRs?  I see lots of DMAR errors, e.g.,
>>>> those in https://bugzilla.redhat.com/show_bug.cgi?id=743790, that are
>>>> reported with printk and don't seem to cause an SERR.  Maybe the SERR
>>>> is system-specific behavior?
>>>>
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=568153 is another (public)
>>>> report of IOMMU errors related to a driver bug where we just get
>>>> printks, not SERR.
>>>
>>> Yes, it depends on platform or devices. At least PCI SERR is detected on
>>> Fujitsu PRIMERGY BX920S2 and TX300S6.
>>>
>>> Intel VT-d documents[1] says:
>>>
>>>     3.5.1 Hardware Handling of Faulting DMA Requests
>>>     DMA requests that result in remapping faults must be blocked by
>>>     hardware. The exact method of DMA blocking is
>>>     implementation-specific.  For example:
>>>
>>>     - Faulting DMA write requests may be handled in much the same way as
>>>       hardware handles write requests to non-existent memory. For
>>>       example, the DMA request is discarded in a manner convenient for
>>>       implementations (such as by dropping the cycle, completing the
>>>       write request to memory with all byte enables masked off,
>>>       re-directed to a dummy memory location, etc.).
>>>
>>>     - Faulting DMA read requests may be handled in much the same way as
>>>       hardware handles read requests to non-existent memory. For
>>>       example, the request may be redirected to a dummy memory location,
>>>       returned as all 0<92>s or 1<92>s in a manner convenient to the
>>>       implementation, or the request may be completed with an explicit
>>>       error indication (preferred). For faulting DMA read requests from
>>>       PCI Express devices, hardware must indicate "Unsupported Request"
>>>       (UR) in the completion status field of the PCI Express read
>>>       completion.
>>>
>>> So, after IOMMU error, its behavior is implementation-specific.
>>>
>>> [1]
>>> Intel Virtualization Technology for Directed I/O Architecture
>>> Specification Rev1.3
>>>
>>>>
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=743495 looks like a hang
>>>> when the kdump kernel reboots (after successfully saving a crashdump).
>>>>    But it is using "iommu=pt", which I don't believe makes sense.  The
>>>> scenario is basically case 4 above, but instead of the kdump kernel
>>>> starting with no valid IOMMU mappings, it identity-maps bus addresses
>>>> to physical memory addresses.  That's completely bogus because it's
>>>> certainly not what the system kernel did, so it's entirely likely to
>>>> make the system unstable or hang.  This is not an argument for doing a
>>>> reset; it's an argument for doing something smarter than "iommu=pt" in
>>>> the kdump kernel.
>>
>>>> We might still want to reset PCIe devices, but I want to make sure
>>>> that we're not papering over other issues when we do.  Therefore, I'd
>>>> like to understand why IOMMU errors seem harmless in some cases but
>>>> not in others.
>>>
>>> As I wrote above, IOMMU behavior on error is up to platform/devcies. I
>>> think it also depends on the value of Uncorrectable Error Mask Register
>>> in AER registers of each device.
>>>
>>>>> Case 1:  Harmless
>>>>> Case 2:  Not tested
>>>>> Case 3a: Not tested
>>>>> Case 3b: Cause problem, fixed by my patch
>>>>> Case 4:  Cause problem, fixed by my patch
>>>>>
>>>>> I have never tested case 2 and 3a, but I think it also causes problem.
>>>>
>>>> I do not believe we need to support case 3b (IOMMU on in system kernel
>>>> but disabled in kdump kernel).  There is no way to make that reliable
>>>> unless every single device that may perform DMA is reset, and since
>>>> you don't reset legacy PCI or VGA devices, you're not even proposing
>>>> to do that.
>>>>
>>>> I think we need to support case 1 (for systems with no IOMMU at all)
>>>> and case 4 (IOMMU enabled in both system kernel and kdump kernel).  If
>>>> the kdump kernel can detect whether the IOMMU is enabled, that should
>>>> be enough -- it could figure out automatically whether we're in case 1
>>>> or 4.
>>>
>>> Ok, I completely agree.
>>>
>>>
>>>>>> Do you have any bugzilla references or problem report URLs you could
>>>>>> include here?
>>>>>
>>>>> I know three Red Hat bugzilla, but I think these are private article and
>>>>> you cannot see. I'll add you Cc list in bz so that you can see.
>>>>>
>>>>> BZ#743790: Kdump fails with intel_iommu=on
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=743790
>>>>>
>>>>> BZ#743495: Hardware error is detected with intel_iommu=on
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=743495
>>>>>
>>>>> BZ#833299: megaraid_sas doesn't work well with intel_iommu=on and iommu=pt
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=833299
>>>>
>>>> Thanks for adding me to the CC lists.  I looked all three and I'm not
>>>> sure there's anything sensitive in them.  It'd be nice if they could
>>>> be made public if there's not.
>>>
>>> I really appreciate your comments! I'll confirm I can make it public.
>>
>> I would greatly appreciate being able to see the bugzillas relating to
>> this patch.
>>>
>>> Thanks,
>>> Takao Indoh
>>
>> Thinking out of the box:
>> Much of the discussion about dealing with the ongoing DMA leftover
>> from the system kernel has assumed that the crash kernel will reset
>> the IOMMU -- which causes various problems if done while there is any DMA
>> still active -- which leads to the idea of stopping all of the DMA.
>>
>> Suppose the crash kernel does not reset the hardware IOMMU, but simply
>> detects that it is active, resets only the devices that are necessary
>> for the crash kernel to operate, and re-programs only the translations
> 
> This suggestion brings up this one:
> Only reset the devices that are generating IOMMU faults, i.e., add a kdump kernel
> hook to the IOMMU fault handler.  While kdump kernel re-initing, IOMMU faults are grabbed
> by the kexec'd kernel and the device is reset. IOW, directed, kdump device reset.
> If no faults, then device is idle, or re-init'd and using new maps, and
> all is well.  Once kdump kernel fully init'd, then just return from
> kdump kernel callout, and let system do its fault handling.
> It can be made mostly common (reset code in kexec mod under drivers/iommu),
> with simple calls out from each IOMMU fault handler.

What I tried before is:
1) Prepare work queue handler
2) In IOMMU fault handler, wake up the work queue handler
3) In the work queue handler, reset devices against the error
   source.

As a result, this method did not work. The device are reset during its
initialization. Please take a look at the message below.  This is a
message when megaraid_sas driver is loaded in second kernel.

megasas: 00.00.05.40-rh2 Thu. Aug. 4 17:00:00 PDT 2011
megaraid_sas 0000:01:00.0: resetting MSI-X
megasas: 0x1000:0x0073:0x1734:0x1177: bus 1:slot 0:func 0
megaraid_sas 0000:01:00.0: PCI INT A -> GSI 28 (level, low) -> IRQ 28
megaraid_sas 0000:01:00.0: setting latency timer to 64
megasas: Waiting for FW to come to ready state
DRHD: handling fault status reg 702
DMAR:[DMA Write] Request device [01:00.0] fault addr ffff9000
DMAR:[fault reason 05] PTE Write access is not set
megasas: FW now in Ready state
  alloc irq_desc for 51 on node -1
  alloc kstat_irqs on node -1
megaraid_sas 0000:01:00.0: irq 51 for MSI/MSI-X
megasas_init_mfi: fw_support_ieee=67108864

Note that DMAR error is detected during driver initialization. So when I
added reset method which I mentioned above, the device is reset here,
and after that, megasas could not find its disks and kdump failed.

Thanks,
Takao Indoh

> 
> Of course, this assumes that systems don't turn IOMMU faults into system SERRs,
> and crash the system.  IMO, as I've stated to a number of system developers,
> IOMMU (mapping) faults should not crash the system -- they already isolate, and
> prevent corruption, so worse case, some part of the system will stop doing I/O,
> and that will either get retried, or aborted and a cleaner panic (and kdump
> kernel switch) will ensue.
> 
>> for those devices.  All other translations remain the same (and remain valid)
>> so all leftover DMA continues into its buffer in the system kernel area
>> where it is harmless.  New translations needed by the kdump kernel are
>> added to the existing tables.
>>
> A crashed system shouldn't assume that ongoing DMA is sane.... it should be stopped.
> I would expect the kdump kernel to reset devices & acquire new dma mappings
> upon reboot.   Thus, the only issue is how to 'throttle' IOMMU faults, and
> not allow them to crash systems while the system is recovering/resetting itself,
> but it's not one big (power) reset to cause it.
> 
>> I have not yet tried this, so I am not ready to propose it as anything more
>> than a discussion topic at this time.
>>
>> It might work this way: (A small modification to case 3a above)
>>
>> IOMMU on in system kernel, kdump kernel accepts active IOMMU
>>     system kernel enables IOMMU
>>        DMA targets IOMMU-mapped system-kernel memory
>>     kexec to kdump kernel (IOMMU on, devices untouched)
>>        DMA targets IOMMU-mapped system-kernel memory
> it may not, it may be bad/bogus device I/O causing the crash...
> 
>>     kdump kernel detects active IOMMU and doesn't touch it
>>        DMA targets IOMMU-mapped system-kernel memory
>>     kdump kernel does not re-initialize IOMMU hardware
>>     kdump kernel initializes IOMMU in-memory management structures
>>     kdump kernel calls device drivers' standard initialization functions
>>        Drivers initialize their own devices -- DMA from that device stops
>>        When drivers request new DMA mappings, the kdump IOMMU driver:
>>        1. Updates its in-memory mgt structures for that device&  range
>>        2. Updates IOMMU translate tables for that device&  range
>>           . Translations for all other devices&  ranges are unchanged
>>        3. Flushes IOMMU TLB to force IOMMU hardware update
>>
>>     Notes:
>>        A. This seems very similar to case 1
>>
>>        B. Only touch devices with drivers loaded into kdump kernel.
>>           . No need to touch devices that kdump is not going to use.
>>
>>        C. This assumes that all DMA device drivers used by kdump will
>>           initialize the device before requesting DMA mappings.
>>           . Seems reasonable, but need to confirm how many do (or don't)
>>           . Only device drivers that might be included in the kdump
>>             kernel need to observe this initialization ordering.
>>
>>        D. Could copy system kernel's translate tables into kdump kernel
>>           and adjust pointers if this feels more trustworthy than using
>>           the original structures where they are in the system kernel.
>>
>>        E. Handle IRQ remappings in a similar manner.
> An even more serious attack vector: flood system w/interrupts (by assigned device in a guest),
> effectively creating a DoS, b/c intrs caused raised (hw) ipl, while bogus DMA does
> not cause system ipl elevation(blockage of other system progress), except when it
> generates IOMMU faults which become intrs.
> hmmm, wondering if another solution is to flip IOMMU fault handling into a poll-mode
> at kdump init time, and then flip it to intr-driven, once I/O has been completely init'd ?
> Per-device fault throttling would be a nice hw feature! (wishful thinking)
> 
> 
>>
>> Thanks,
>> Bill Sumner
>>
>>
> 
> 
>