On 04/25/2013 11:10 PM, Takao Indoh wrote: > (2013/04/26 3:01), Don Dutile wrote: >> On 04/25/2013 01:11 AM, Takao Indoh wrote: >>> (2013/04/25 4:59), Don Dutile wrote: >>>> On 04/24/2013 12:58 AM, Takao Indoh wrote: >>>>> This patch resets PCIe devices on boot to stop ongoing DMA. When >>>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each >>>>> PCIe root port and downstream port to reset its downstream endpoint. >>>>> >>>>> Problem: >>>>> This patch solves the problem that kdump can fail when intel_iommu=on is >>>>> specified. When intel_iommu=on is specified, many dma-remapping errors >>>>> occur in second kernel and it causes problems like driver error or PCI >>>>> SERR, at last kdump fails. This problem is caused as follows. >>>>> 1) Devices are working on first kernel. >>>>> 2) Switch to second kernel(kdump kernel). The devices are still working >>>>> and its DMA continues during this switch. >>>>> 3) iommu is initialized during second kernel boot and ongoing DMA causes >>>>> dma-remapping errors. >>>>> >>>>> Solution: >>>>> All DMA transactions have to be stopped before iommu is initialized. By >>>>> this patch devices are reset and in-flight DMA is stopped before >>>>> pci_iommu_init. >>>>> >>>>> To invoke hot reset on an endpoint, its upstream link need to be reset. >>>>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root >>>>> port/downstream port whose child is PCIe endpoint, and then reset link >>>>> between them. If the endpoint is VGA device, it is skipped because the >>>>> monitor blacks out if VGA controller is reset. >>>>> >>>> Couple questions wrt VGA device: >>>> (1) Many graphics devices are multi-function, one function being VGA; >>>> is the VGA always function 0, so this scan sees it first& doesn't >>>> do a reset on that PCIe link? if the VGA is not function 0, won't >>>> this logic break (will reset b/c function 0 is non-VGA graphics) ? >>> >>> VGA is not reset irrespective of its function number. The logic of this >>> patch is: >>> >>> for_each_pci_dev(dev) { >>> if (dev is not PCIe) >>> continue; >>> if (dev is not root port/downstream port) ---(1) >>> continue; >>> list_for_each_entry(child,&dev->subordinate->devices, bus_list) { >>> if (child is upstream port or bridge or VGA) ---(2) >>> continue; >>> } >>> do_reset_its_child(dev); >>> } >>> >>> Therefore VGA itself is skipped by (1), and upstream device(root port or >>> downstream port) of VGA is also skipped by (2). >>> >>> >>>> (2) I'm hearing VGA will soon not be the a required console; this logic >>>> assumes it is, and why it isn't blanked. >>>> Q: Should the filter be based on a device having a device-class of display ? >>> >>> I want to avoid the situation that user's monitor blacks out and user >>> cannot know what's going on. That's reason why I introduced the logic to >>> skip VGA. As far as I tested the logic based on device-class works well, >> sorry, I read your description, which said VGA, but your are filtering on display class, >> which includes non-VGA as well. So, all set ... but large, (x16) non-VGA display devices >> are probably one of the most aggressive DMA engines on a system.... and will grow as >> asymmetric processing using GPUs gets architected into a device-agnostic manner. >> So, this may work well for servers, which is the primary consumer/user of this feature, >> and they typically have built-in graphics that are generally used in simple VGA mode, >> so this may be sufficient for now. > > Ok, understood. > > >> >>> but I would appreciate it if there are better ways. >>> >> You probably don't want to hear it but.... >> a) only turn off cmd-reg master enable bit >> b) only do reset based on a list of devices known not to >> obey their cmd-reg master enable bit, and only do reset to those devices. >> But, given the testing you've done so far, this optional (need cmdline) feature, >> let's start here. > > Ok. Either way I think we need more testing. > > >>>> >>>>> Actually this is v8 patch but quite different from v7 and it's been so >>>>> long since previous post, so I start over again. >>>> Thanks for this re-start. I need to continue reviewing the rest. >>> >>> Thank you for your review! >>> >>>> >>>> Q: Why not force IOMMU off when re-booting a kexec kernel to perform a crash >>>> dump? After the crash dump, the system is rebooting to previous (iommu=on) setting. >>>> That logic, along w/your previous patch to disable the IOMMU if iommu=off >>>> is set, would remove this (relatively slow) PCI init sequencing ? >>> >>> To force iommu off, all ongoing DMA have to be stopped before that since >>> they are accessing the device address, not physical address. If we disable >>> iommu without stopping in-flihgt DMA, devices access invalid memory area >>> and it causes memory corruption or PCI-SERR due to DMA error. >> Right, that's a 'duh' on my part. >> I thought 'disable iommu' == 'block all dma' and it just turns it off& >> let's the ongoing DMA run... >> Please ignore this question... sigh. >> >>> >>> So, whether we use iommu or not in second kernel, we have to stop DMA in >>> second kernel if iommu is used in first kernel. >>> >>> Thanks, >>> Takao Indoh >>> >>> >>>> >>>>> Previous post: >>>>> [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump >>>>> https://lkml.org/lkml/2012/11/26/814 >>>>> >>>>> Signed-off-by: Takao Indoh<indou.takao at jp.fujitsu.com> >>>>> --- >>>>> Documentation/kernel-parameters.txt | 2 + >>>>> drivers/pci/pci.c | 103 +++++++++++++++++++++++++++++++++++ >>>>> 2 files changed, 105 insertions(+), 0 deletions(-) >>>>> >>>>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt >>>>> index 4609e81..2a31ade 100644 >>>>> --- a/Documentation/kernel-parameters.txt >>>>> +++ b/Documentation/kernel-parameters.txt >>>>> @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. >>>>> any pair of devices, possibly at the cost of >>>>> reduced performance. This also guarantees >>>>> that hot-added devices will work. >>>>> + pcie_reset_devices Reset PCIe endpoint on boot by hot >>>>> + reset >>>>> cbiosize=nn[KMG] The fixed amount of bus space which is >>>>> reserved for the CardBus bridge's IO window. >>>>> The default value is 256 bytes. >>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >>>>> index b099e00..42385c9 100644 >>>>> --- a/drivers/pci/pci.c >>>>> +++ b/drivers/pci/pci.c >>>>> @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus) >>>>> } >>>>> EXPORT_SYMBOL(pci_fixup_cardbus); >>>>> >>>>> +/* >>>>> + * Return true if dev is PCIe root port or downstream port whose child is PCIe >>>>> + * endpoint except VGA device. >>>>> + */ >>>>> +static int __init need_reset(struct pci_dev *dev) >>>>> +{ >>>>> + struct pci_bus *subordinate; >>>>> + struct pci_dev *child; >>>>> + >>>>> + if (!pci_is_pcie(dev) || !dev->subordinate || >>>>> + list_empty(&dev->subordinate->devices) || >>>>> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&& >>>>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))) >>>>> + return 0; >>>>> + >>>>> + subordinate = dev->subordinate; >>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>>> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) || >>>>> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) || >>>>> + ((child->class>> 16) == PCI_BASE_CLASS_DISPLAY)) >>>>> + /* Don't reset switch, bridge, VGA device */ >>>>> + return 0; >>>>> + } >>>>> + >>>>> + return 1; >>>>> +} >>>>> + >>>>> +static void __init save_config(struct pci_dev *dev) >> This should be renamed. it implies it is saving the config of the pdev passed in, >> when in reality, it is saving the config of all the devices attached to this pdev. >> i.e., save_downstream_configs() >> >>>>> +{ >>>>> + struct pci_bus *subordinate; >>>>> + struct pci_dev *child; >>>>> + >>>>> + if (!need_reset(dev)) >>>>> + return; >>>>> + >>>>> + subordinate = dev->subordinate; >>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>>> + dev_info(&child->dev, "save state\n"); >>>>> + pci_save_state(child); >>>>> + } >>>>> +} >>>>> + >>>>> +static void __init restore_config(struct pci_dev *dev) >> inverse of above: restore_downstream_configs() >>>>> +{ >>>>> + struct pci_bus *subordinate; >>>>> + struct pci_dev *child; >>>>> + >>>>> + if (!need_reset(dev)) >>>>> + return; >>>>> + >>>>> + subordinate = dev->subordinate; >>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>>> + dev_info(&child->dev, "restore state\n"); >>>>> + pci_restore_state(child); >>>>> + } >>>>> +} >>>>> + >>>>> +static void __init do_device_reset(struct pci_dev *dev) >> do_downstream_device_reset() -- it's not resetting this pdev, >> but the pdev's of the devices attached to it. >> > Right, I'll change them as you said. > > >>>>> +{ >>>>> + u16 ctrl; >>>>> + >>>>> + if (!need_reset(dev)) >>>>> + return; >>>>> + >>>>> + dev_info(&dev->dev, "Reset Secondary bus\n"); >>>>> + >>>>> + /* Assert Secondary Bus Reset */ >>>>> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&ctrl); >>>>> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET; >>>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); >>>>> + >>>>> + msleep(2); >>>>> + >> This works well for x86, which uses ioport registers to access >> these<256-offset registers, b/c the write() function can't return >> until the write is actually completed, but for a non-x86 system, >> with fully mmconf'd PCI space, a write() may still be a write& run >> (sitting in CPU write-merge) buffer, so if you need a full 2ms, >> you ought to do another read_config() to the device, to flush the write, >> before starting the msleep(2) clock. >> > I didn't know that... I'll insert something like this before msleep. > > pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&dummy); > > BTW, I referred aer_do_secondary_bus_reset() when I wrote this code. > Probably need the same fix, right? > sounds like it. Interested to hear from someone in non-x86-land how they ensure pci cfg writes aren't buffered in their cpus; maybe handled via special uncached regions on other arches. > >>>>> + /* De-assert Secondary Bus Reset */ >>>>> + ctrl&= ~PCI_BRIDGE_CTL_BUS_RESET; >>>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); >>>>> +} >>>>> + >>>>> +static int __initdata pcie_reset_devices; >>>>> +static int __init reset_pcie_devices(void) >> this should be reset_pcie_endpoints() ... it's not resetting all pcie devices >>>>> +{ >>>>> + struct pci_dev *dev = NULL; >>>>> + >>>>> + if (!pcie_reset_devices) >>>>> + return 0; >>>>> + >>>>> + for_each_pci_dev(dev) >>>>> + save_config(dev); >>>>> + >>>>> + for_each_pci_dev(dev) >>>>> + do_device_reset(dev); >>>>> + >>>>> + msleep(1000); >>>>> + >>>>> + for_each_pci_dev(dev) >>>>> + restore_config(dev); >>>>> + >> My apologies if past thread covered this sequence... >> Why three loops through all PCIe devices on the system? >> why not have the first for-each-pci-dev() loop filter devices >> to be reset, and save_config those pdev's, >> return a list of saved_pdev's; feed that list into the do_device_reset(); >> then mpsleep(), then restore the list. >> in fact, once you create a link list of pdev's to reset, >> just loop that list doing save, then reset; rtn the list, do msleep(), >> then restore the config of pdevs in the list. >> Otherwise, doing much more traversing than what's needed. >> Doing a great deal more config-saving then needed right now as well >> (saving all non-endpt devices that aren't reset). >> > > Creating list is nice idea, thanks. I'll do. > > I'll have vacation next week, so I'll post v2 patch the week after > next. > > Thanks, > Takao Indoh > Have a good vacation! Thanks for your willingness to respin again & make this improvement.