On 10/20/2014 07:20 AM, Gavin Shan wrote: > On Fri, Oct 17, 2014 at 02:44:43AM -0700, Eric W. Biederman wrote: >> "Li, Zhen-Hua" <zhen-hual at hp.com> writes: >> >>> This is an update of the patch >>> https://lkml.org/lkml/2014/10/10/37 >>> >>> This patch is doing the reset works before the kdump kernel boots. >> >> If I have said it once I have said it a thousand times. >> >> Nacked-by: "Eric W. Biederman" <ebiederm at xmission.com> >> Nacked-by: "Eric W. Biederman" <ebiederm at xmission.com> >> Nacked-by: "Eric W. Biederman" <ebiederm at xmission.com> >> >> It is absolutely inappropriate to do this in the kernel that has >> crashed. >> >> Either reserve a chunk of the iommu for use by the crash dump kernel, >> or figure out how to do this during boot up. >> >> But this is absolutely and totally inappropriate to do in crash_kexec. >> The failure modes are all wrong. It will impact the reliability of our >> crash dumps. >> >> If an the code for a linux architecture is not structured in such a way >> as to make it easy or straight forward to do this my sympathies you are >> going to have do fix that linux port to be able to do things during >> boot. >> >> But things called from crash_kexec should be absolute necessities and I >> do not see this as coming anywhere close to being something that is >> impossible to do in the target kernel. >> > > Zhenhua, Intrestingly, PPC has mechanism to reset the root port for kdump case > in order to drop pending DMA traffic. Maybe you can implement similar thing. > However, this kind of reset would be platform/board specific and needs support > from underly firmware. More details could be found from arch/powerpc/ > platforms/powernv/pci-ioda.c > > if (is_kdump_kernel()) { > pr_info(" Issue PHB reset ...\n"); > ioda_eeh_phb_reset(hose, EEH_RESET_FUNDAMENTAL); > ioda_eeh_phb_reset(hose, OPAL_DEASSERT_RESET); > } > > Please check if it's the thing you're looking for. I didn't understand your > requirements 100% > > Thanks, > Gavin Gavin, I was trying to fix a DMAR fault for iommu. On X86 platform, iommu is initialized after pci , while on powerpc iommu is before pci. So I tried to do the operation in the first kernel, and this can work on all platforms. But according to Eric's mail, it is inappropriate doing this in crash_kexec. So I have to figure out how to do this in the kdump kernel. As it is clear the powerpc has already do the resetting during booting, so I only need to implement a similar mechanism on X86 system. Thanks Zhenhua > >> Eric >> >> >>> On a Linux system with iommu supported and many PCI devices on it, >>> when kernel crashed and the kdump kernel boots with intel_iommu=on, >>> there may be some unexpected DMA requests on this adapter, which will >>> cause DMA Remapping faults like: >>> dmar: DRHD: handling fault status reg 102 >>> dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000 >>> DMAR:[fault reason 01] Present bit in root entry is clear >>> >>> This bug may happen on *any* PCI device. >>> Analysis for this bug: >>> >>> The present bit is set in this function: >>> >>> static struct context_entry * device_to_context_entry( >>> struct intel_iommu *iommu, u8 bus, u8 devfn) >>> { >>> ...... >>> set_root_present(root); >>> ...... >>> } >>> >>> Calling tree: >>> device driver >>> intel_alloc_coherent >>> __intel_map_single >>> domain_context_mapping >>> domain_context_mapping_one >>> device_to_context_entry >>> >>> This means, the present bit in root entry will not be set until the device >>> driver is loaded. >>> >>> But in the kdump kernel, hardware devices are not aware that control has >>> transferred to the second kernel, and those drivers must initialize again. >>> Consequently there may be unexpected DMA requests from devices activity >>> initiated in the first kernel leading to the DMA Remapping errors in the >>> second kernel. >>> >>> To fix this DMAR fault, we need to reset the bus that this device on. Reset >>> the device itself does not work. >>> >>> A patch for this bug that has been sent before: >>> https://lkml.org/lkml/2014/9/30/55 >>> As in discussion, this bug may happen on *any* device, so we need to reset >>> all pci devices. >>> >>> There was an original version(Takao Indoh) that resets the pcie devices: >>> https://lkml.org/lkml/2013/5/14/9 >>> >>> According to the previous discussion, On sparc, the IOMMU is initialized >>> before PCI devices are enumerated, this patch does the resetting works before >>> the kdump kernel boots, so it can also fix the problems on sparc. >>> >>> Update of this new version, comparing with Takao Indoh's version: >>> Add support for legacy PCI devices. >>> Use pci_try_reset_bus instead of do_downstream_device_reset. >>> Reset all PCI/PCIe deviecs in the first kernel, before kdump kernel boots. >>> >>> Randy Wright corrects some misunderstanding in this description. >>> >>> Signed-off-by: Li, Zhen-Hua <zhen-hual at hp.com> >>> Signed-off-by: Takao Indoh <indou.takao at jp.fujitsu.com> >>> Signed-off-by: Randy Wright <rwright at hp.com> >>> --- >>> drivers/pci/pci.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> include/linux/pci.h | 6 ++++ >>> kernel/kexec.c | 2 ++ >>> 3 files changed, 90 insertions(+) >>> >>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >>> index 625a4ac..aa9192a 100644 >>> --- a/drivers/pci/pci.c >>> +++ b/drivers/pci/pci.c >>> @@ -23,6 +23,7 @@ >>> #include <linux/device.h> >>> #include <linux/pm_runtime.h> >>> #include <linux/pci_hotplug.h> >>> +#include <linux/crash_dump.h> >>> #include <asm-generic/pci-bridge.h> >>> #include <asm/setup.h> >>> #include "pci.h" >>> @@ -4466,6 +4467,87 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus) >>> } >>> EXPORT_SYMBOL(pci_fixup_cardbus); >>> >>> +/* >>> + * Return true if dev is PCI root port or downstream port whose child is PCI >>> + * endpoint except VGA device. >>> + */ >>> +static int __pci_dev_need_reset(struct pci_dev *dev) >>> +{ >>> + struct pci_bus *subordinate; >>> + struct pci_dev *child; >>> + >>> + if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) >>> + return 0; >>> + >>> + if (pci_is_pcie(dev)) { >>> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) && >>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)) >>> + return 0; >>> + } >>> + >>> + subordinate = dev->subordinate; >>> + list_for_each_entry(child, &subordinate->devices, bus_list) { >>> + /* Don't reset switch, bridge, VGA device */ >>> + if ((child->hdr_type == PCI_HEADER_TYPE_BRIDGE) || >>> + ((child->class >> 16) == PCI_BASE_CLASS_BRIDGE) || >>> + ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY)) >>> + return 0; >>> + >>> + if (pci_is_pcie(child)) { >>> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) || >>> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE)) >>> + return 0; >>> + } >>> + } >>> + >>> + return 1; >>> +} >>> + >>> +struct pci_dev_reset_entry { >>> + struct list_head list; >>> + struct pci_dev *dev; >>> +}; >>> +int pci_reset_endpoints(void) >>> +{ >>> + struct pci_dev *dev = NULL; >>> + struct pci_dev_reset_entry *pdev_entry, *tmp; >>> + struct pci_bus *subordinate = NULL; >>> + int has_it; >>> + >>> + LIST_HEAD(pdev_list); >>> + >>> + >>> + for_each_pci_dev(dev) { >>> + subordinate = dev->subordinate; >>> + if (!subordinate || list_empty(&subordinate->devices)) >>> + continue; >>> + >>> + has_it = 0; >>> + list_for_each_entry(pdev_entry, &pdev_list, list) { >>> + if (dev == pdev_entry->dev) { >>> + has_it = 1; >>> + break; >>> + } >>> + } >>> + if (has_it) >>> + continue; >>> + >>> + if (__pci_dev_need_reset(dev)) { >>> + pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL); >>> + pdev_entry->dev = dev; >>> + list_add(&pdev_entry->list, &pdev_list); >>> + } >>> + } >>> + >>> + list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) { >>> + pci_try_reset_bus(pdev_entry->dev->subordinate); >>> + kfree(pdev_entry); >>> + } >>> + >>> + return 0; >>> +} >>> +EXPORT_SYMBOL_GPL(pci_reset_endpoints); >>> + >>> static int __init pci_setup(char *str) >>> { >>> while (str) { >>> diff --git a/include/linux/pci.h b/include/linux/pci.h >>> index 5be8db4..1cf7207 100644 >>> --- a/include/linux/pci.h >>> +++ b/include/linux/pci.h >>> @@ -1869,4 +1869,10 @@ static inline bool pci_is_dev_assigned(struct pci_dev *pdev) >>> { >>> return (pdev->dev_flags & PCI_DEV_FLAGS_ASSIGNED) == PCI_DEV_FLAGS_ASSIGNED; >>> } >>> + >>> +/* >>> + * Reset all pci devices by resetting the buses. >>> + */ >>> +int pci_reset_endpoints(void); >>> + >>> #endif /* LINUX_PCI_H */ >>> diff --git a/kernel/kexec.c b/kernel/kexec.c >>> index 2abf9f6..986e8f7 100644 >>> --- a/kernel/kexec.c >>> +++ b/kernel/kexec.c >>> @@ -36,6 +36,7 @@ >>> #include <linux/syscore_ops.h> >>> #include <linux/compiler.h> >>> #include <linux/hugetlb.h> >>> +#include <linux/pci.h> >>> >>> #include <asm/page.h> >>> #include <asm/uaccess.h> >>> @@ -1474,6 +1475,7 @@ void crash_kexec(struct pt_regs *regs) >>> if (kexec_crash_image) { >>> struct pt_regs fixed_regs; >>> >>> + pci_reset_endpoints(); >>> crash_setup_regs(&fixed_regs, regs); >>> crash_save_vmcoreinfo(); >>> machine_crash_shutdown(&fixed_regs); >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-pci" in >> the body of a message to majordomo at vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >