Sorry it's taken me so long to look at this. I've been putting this off because the patch doesn't seem "obviously correct," and I don't feel like I really understand the problem. I'm naive about both IOMMUs and kexec/kdump, so please pardon (and help me with) any silly questions or assumptions below. On Mon, May 13, 2013 at 11:29 PM, Takao Indoh <indou.takao at jp.fujitsu.com> wrote: > This patch resets PCIe devices on boot to stop ongoing DMA. When > "pci=pcie_reset_endpoint_devices" is specified, a hot reset is triggered > on each PCIe root port and downstream port to reset its downstream > endpoint. > > Problem: > This patch solves the problem that kdump can fail when intel_iommu=on is > specified. When intel_iommu=on is specified, many dma-remapping errors > occur in second kernel and it causes problems like driver error or PCI > SERR, at last kdump fails. This problem is caused as follows. > 1) Devices are working on first kernel. > 2) Switch to second kernel(kdump kernel). The devices are still working > and its DMA continues during this switch. > 3) iommu is initialized during second kernel boot and ongoing DMA causes > dma-remapping errors. If I understand correctly, the problem only happens on systems with an IOMMU that's enabled in either the system or kdump kernel (or both). For systems without an IOMMU (or if it is disabled in both the system and kdump kernels), any ongoing DMA should use addresses that target system-kernel memory and should not affect the kdump kernel. One thing I'm not sure about is that you are only resetting PCIe devices, but I don't think the problem is actually specific to PCIe, is it? I think the same issue could occur on any system with an IOMMU. In the x86 PC world, most IOMMUs are connected with PCIe, but there are systems with IOMMUs for plain old PCI devices, e.g., PA-RISC. I tried to make a list of the interesting scenarios and the events that are relevant to this problem: Case 1: IOMMU off in system, off in kdump kernel system kernel leaves IOMMU off DMA targets system-kernel memory kexec to kdump kernel (IOMMU off, devices untouched) DMA targets system-kernel memory (harmless) kdump kernel re-inits device DMA targets kdump-kernel memory Case 2: IOMMU off in system kernel, on in kdump kernel system kernel leaves IOMMU off DMA targets system-kernel memory kexec to kdump kernel (IOMMU off, devices untouched) DMA targets system-kernel memory (harmless) kdump kernel enables IOMMU with no valid mappings DMA causes IOMMU errors (annoying but harmless) kdump kernel re-inits device DMA targets IOMMU-mapped kdump-kernel memory Case 3a: IOMMU on in system kernel, kdump kernel doesn't touch IOMMU system kernel enables IOMMU DMA targets IOMMU-mapped system-kernel memory kexec to kdump kernel (IOMMU on, devices untouched) DMA targets IOMMU-mapped system-kernel memory kdump kernel doesn't know about IOMMU or doesn't touch it DMA targets IOMMU-mapped system-kernel memory kdump kernel re-inits device kernel assumes no IOMMU, so all new DMA mappings are invalid because DMAs actually do go through the IOMMU (** corruption or other non-recoverable error likely **) Case 3b: IOMMU on in system kernel, kdump kernel disables IOMMU system kernel enables IOMMU DMA targets IOMMU-mapped system-kernel memory kexec to kdump kernel (IOMMU on, devices untouched) DMA targets IOMMU-mapped system-kernel memory kdump kernel disables IOMMU DMA targets IOMMU-mapped system-kernel memory, but IOMMU is disabled (** corruption or other non-recoverable error likely **) kdump kernel re-inits device DMA targets kdump-kernel memory Case 4: IOMMU on in system kernel, on in kdump kernel system kernel enables IOMMU DMA targets IOMMU-mapped system-kernel memory kexec to kdump kernel (IOMMU on, devices untouched) DMA targets IOMMU-mapped system-kernel memory kdump kernel enables IOMMU with no valid mappings DMA causes IOMMU errors (annoying but harmless) kdump kernel re-inits device DMA targets IOMMU-mapped kdump-kernel memory The problem cases I see are 3a and 3b, but that's not the problem you're describing. Obviously I'm missing something. It sounds like you're seeing problems in case 2 or case 4, where the IOMMU is enabled in the kdump kernel. Maybe my assumption about the IOMMU being enabled with no valid mappings is wrong? Or maybe those IOMMU errors are not actually harmless? Resetting PCI/PCIe devices will help with cases 2, 4, and 3b, but not with case 3a. Therefore, it seems like the kdump kernel *must* contain IOMMU support unless it knows for certain that the system kernel wasn't using the IOMMU. Do you have any bugzilla references or problem report URLs you could include here? Obviously I'm very confused here, so please help educate me :) Also, you mentioned PCI passthrough with a KVM guest as being a problem. Can you explain more about what makes that a problem? I don't know enough about passthrough to know why a device being used by a KVM guest would cause more problems than a device being used directly by the host. Bjorn > Solution: > All DMA transactions have to be stopped before iommu is initialized. By > this patch devices are reset and in-flight DMA is stopped before > pci_iommu_init. > > To invoke hot reset on an endpoint, its upstream link need to be reset. > reset_pcie_endpoints() is called from fs_initcall_sync, and it finds > root port/downstream port whose child is PCIe endpoint, and then reset > link between them. If the endpoint is VGA device, it is skipped because > the monitor blacks out if VGA controller is reset. > > Signed-off-by: Takao Indoh <indou.takao at jp.fujitsu.com> > --- > Documentation/kernel-parameters.txt | 2 + > drivers/pci/pci.c | 113 +++++++++++++++++++++++++++++++++++ > 2 files changed, 115 insertions(+), 0 deletions(-) > > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt > index c3bfacb..8c9e8e4 100644 > --- a/Documentation/kernel-parameters.txt > +++ b/Documentation/kernel-parameters.txt > @@ -2321,6 +2321,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. > any pair of devices, possibly at the cost of > reduced performance. This also guarantees > that hot-added devices will work. > + pcie_reset_endpoint_devices Reset PCIe endpoint on boot by > + hot reset > cbiosize=nn[KMG] The fixed amount of bus space which is > reserved for the CardBus bridge's IO window. > The default value is 256 bytes. > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index a899d8b..70c1205 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -3873,6 +3873,116 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus) > } > EXPORT_SYMBOL(pci_fixup_cardbus); > > +/* > + * Return true if dev is PCIe root port or downstream port whose child is PCIe > + * endpoint except VGA device. > + */ > +static int __init need_reset(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + if (!pci_is_pcie(dev) || !dev->subordinate || > + list_empty(&dev->subordinate->devices) || > + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) && > + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))) > + return 0; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) || > + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) || > + ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY)) > + /* Don't reset switch, bridge, VGA device */ > + return 0; > + } > + > + return 1; > +} > + > +static void __init save_downstream_configs(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + dev_info(&child->dev, "save state\n"); > + pci_save_state(child); > + } > +} > + > +static void __init restore_downstream_configs(struct pci_dev *dev) > +{ > + struct pci_bus *subordinate; > + struct pci_dev *child; > + > + subordinate = dev->subordinate; > + list_for_each_entry(child, &subordinate->devices, bus_list) { > + dev_info(&child->dev, "restore state\n"); > + pci_restore_state(child); > + } > +} > + > +static void __init do_downstream_device_reset(struct pci_dev *dev) > +{ > + u16 ctrl; > + > + dev_info(&dev->dev, "Reset Secondary bus\n"); > + > + /* Assert Secondary Bus Reset */ > + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl); > + ctrl |= PCI_BRIDGE_CTL_BUS_RESET; > + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); > + > + /* Read config again to flush previous write */ > + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &ctrl); > + > + msleep(2); > + > + /* De-assert Secondary Bus Reset */ > + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET; > + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); > +} > + > +struct pci_dev_entry { > + struct list_head list; > + struct pci_dev *dev; > +}; > +static int __initdata pcie_reset_endpoint_devices; > +static int __init reset_pcie_endpoints(void) > +{ > + struct pci_dev *dev = NULL; > + struct pci_dev_entry *pdev_entry, *tmp; > + LIST_HEAD(pdev_list); > + > + if (!pcie_reset_endpoint_devices) > + return 0; > + > + for_each_pci_dev(dev) > + if (need_reset(dev)) { > + pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL); > + pdev_entry->dev = dev; > + list_add(&pdev_entry->list, &pdev_list); > + } > + > + list_for_each_entry(pdev_entry, &pdev_list, list) > + save_downstream_configs(pdev_entry->dev); > + > + list_for_each_entry(pdev_entry, &pdev_list, list) > + do_downstream_device_reset(pdev_entry->dev); > + > + msleep(1000); > + > + list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) { > + restore_downstream_configs(pdev_entry->dev); > + kfree(pdev_entry); > + } > + > + return 0; > +} > +fs_initcall_sync(reset_pcie_endpoints); > + > static int __init pci_setup(char *str) > { > while (str) { > @@ -3915,6 +4025,9 @@ static int __init pci_setup(char *str) > pcie_bus_config = PCIE_BUS_PEER2PEER; > } else if (!strncmp(str, "pcie_scan_all", 13)) { > pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS); > + } else if (!strncmp(str, "pcie_reset_endpoint_devices", > + 27)) { > + pcie_reset_endpoint_devices = 1; > } else { > printk(KERN_ERR "PCI: Unknown option `%s'\n", > str); > -- > 1.7.1 > >