(2013/04/26 3:01), Don Dutile wrote: > On 04/25/2013 01:11 AM, Takao Indoh wrote: >> (2013/04/25 4:59), Don Dutile wrote: >>> On 04/24/2013 12:58 AM, Takao Indoh wrote: >>>> This patch resets PCIe devices on boot to stop ongoing DMA. When >>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each >>>> PCIe root port and downstream port to reset its downstream endpoint. >>>> >>>> Problem: >>>> This patch solves the problem that kdump can fail when intel_iommu=on is >>>> specified. When intel_iommu=on is specified, many dma-remapping errors >>>> occur in second kernel and it causes problems like driver error or PCI >>>> SERR, at last kdump fails. This problem is caused as follows. >>>> 1) Devices are working on first kernel. >>>> 2) Switch to second kernel(kdump kernel). The devices are still working >>>> and its DMA continues during this switch. >>>> 3) iommu is initialized during second kernel boot and ongoing DMA causes >>>> dma-remapping errors. >>>> >>>> Solution: >>>> All DMA transactions have to be stopped before iommu is initialized. By >>>> this patch devices are reset and in-flight DMA is stopped before >>>> pci_iommu_init. >>>> >>>> To invoke hot reset on an endpoint, its upstream link need to be reset. >>>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root >>>> port/downstream port whose child is PCIe endpoint, and then reset link >>>> between them. If the endpoint is VGA device, it is skipped because the >>>> monitor blacks out if VGA controller is reset. >>>> >>> Couple questions wrt VGA device: >>> (1) Many graphics devices are multi-function, one function being VGA; >>> is the VGA always function 0, so this scan sees it first& doesn't >>> do a reset on that PCIe link? if the VGA is not function 0, won't >>> this logic break (will reset b/c function 0 is non-VGA graphics) ? >> >> VGA is not reset irrespective of its function number. The logic of this >> patch is: >> >> for_each_pci_dev(dev) { >> if (dev is not PCIe) >> continue; >> if (dev is not root port/downstream port) ---(1) >> continue; >> list_for_each_entry(child,&dev->subordinate->devices, bus_list) { >> if (child is upstream port or bridge or VGA) ---(2) >> continue; >> } >> do_reset_its_child(dev); >> } >> >> Therefore VGA itself is skipped by (1), and upstream device(root port or >> downstream port) of VGA is also skipped by (2). >> >> >>> (2) I'm hearing VGA will soon not be the a required console; this logic >>> assumes it is, and why it isn't blanked. >>> Q: Should the filter be based on a device having a device-class of display ? >> >> I want to avoid the situation that user's monitor blacks out and user >> cannot know what's going on. That's reason why I introduced the logic to >> skip VGA. As far as I tested the logic based on device-class works well, > sorry, I read your description, which said VGA, but your are filtering on display class, > which includes non-VGA as well. So, all set ... but large, (x16) non-VGA display devices > are probably one of the most aggressive DMA engines on a system.... and will grow as > asymmetric processing using GPUs gets architected into a device-agnostic manner. > So, this may work well for servers, which is the primary consumer/user of this feature, > and they typically have built-in graphics that are generally used in simple VGA mode, > so this may be sufficient for now. Ok, understood. > >> but I would appreciate it if there are better ways. >> > You probably don't want to hear it but.... > a) only turn off cmd-reg master enable bit > b) only do reset based on a list of devices known not to > obey their cmd-reg master enable bit, and only do reset to those devices. > But, given the testing you've done so far, this optional (need cmdline) feature, > let's start here. Ok. Either way I think we need more testing. >>> >>>> Actually this is v8 patch but quite different from v7 and it's been so >>>> long since previous post, so I start over again. >>> Thanks for this re-start. I need to continue reviewing the rest. >> >> Thank you for your review! >> >>> >>> Q: Why not force IOMMU off when re-booting a kexec kernel to perform a crash >>> dump? After the crash dump, the system is rebooting to previous (iommu=on) setting. >>> That logic, along w/your previous patch to disable the IOMMU if iommu=off >>> is set, would remove this (relatively slow) PCI init sequencing ? >> >> To force iommu off, all ongoing DMA have to be stopped before that since >> they are accessing the device address, not physical address. If we disable >> iommu without stopping in-flihgt DMA, devices access invalid memory area >> and it causes memory corruption or PCI-SERR due to DMA error. > Right, that's a 'duh' on my part. > I thought 'disable iommu' == 'block all dma' and it just turns it off & > let's the ongoing DMA run... > Please ignore this question... sigh. > >> >> So, whether we use iommu or not in second kernel, we have to stop DMA in >> second kernel if iommu is used in first kernel. >> >> Thanks, >> Takao Indoh >> >> >>> >>>> Previous post: >>>> [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump >>>> https://lkml.org/lkml/2012/11/26/814 >>>> >>>> Signed-off-by: Takao Indoh<indou.takao@xxxxxxxxxxxxxx> >>>> --- >>>> Documentation/kernel-parameters.txt | 2 + >>>> drivers/pci/pci.c | 103 +++++++++++++++++++++++++++++++++++ >>>> 2 files changed, 105 insertions(+), 0 deletions(-) >>>> >>>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt >>>> index 4609e81..2a31ade 100644 >>>> --- a/Documentation/kernel-parameters.txt >>>> +++ b/Documentation/kernel-parameters.txt >>>> @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. >>>> any pair of devices, possibly at the cost of >>>> reduced performance. This also guarantees >>>> that hot-added devices will work. >>>> + pcie_reset_devices Reset PCIe endpoint on boot by hot >>>> + reset >>>> cbiosize=nn[KMG] The fixed amount of bus space which is >>>> reserved for the CardBus bridge's IO window. >>>> The default value is 256 bytes. >>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >>>> index b099e00..42385c9 100644 >>>> --- a/drivers/pci/pci.c >>>> +++ b/drivers/pci/pci.c >>>> @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus) >>>> } >>>> EXPORT_SYMBOL(pci_fixup_cardbus); >>>> >>>> +/* >>>> + * Return true if dev is PCIe root port or downstream port whose child is PCIe >>>> + * endpoint except VGA device. >>>> + */ >>>> +static int __init need_reset(struct pci_dev *dev) >>>> +{ >>>> + struct pci_bus *subordinate; >>>> + struct pci_dev *child; >>>> + >>>> + if (!pci_is_pcie(dev) || !dev->subordinate || >>>> + list_empty(&dev->subordinate->devices) || >>>> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&& >>>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))) >>>> + return 0; >>>> + >>>> + subordinate = dev->subordinate; >>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) || >>>> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) || >>>> + ((child->class>> 16) == PCI_BASE_CLASS_DISPLAY)) >>>> + /* Don't reset switch, bridge, VGA device */ >>>> + return 0; >>>> + } >>>> + >>>> + return 1; >>>> +} >>>> + >>>> +static void __init save_config(struct pci_dev *dev) > This should be renamed. it implies it is saving the config of the pdev passed in, > when in reality, it is saving the config of all the devices attached to this pdev. > i.e., save_downstream_configs() > >>>> +{ >>>> + struct pci_bus *subordinate; >>>> + struct pci_dev *child; >>>> + >>>> + if (!need_reset(dev)) >>>> + return; >>>> + >>>> + subordinate = dev->subordinate; >>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>> + dev_info(&child->dev, "save state\n"); >>>> + pci_save_state(child); >>>> + } >>>> +} >>>> + >>>> +static void __init restore_config(struct pci_dev *dev) > inverse of above: restore_downstream_configs() >>>> +{ >>>> + struct pci_bus *subordinate; >>>> + struct pci_dev *child; >>>> + >>>> + if (!need_reset(dev)) >>>> + return; >>>> + >>>> + subordinate = dev->subordinate; >>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>> + dev_info(&child->dev, "restore state\n"); >>>> + pci_restore_state(child); >>>> + } >>>> +} >>>> + >>>> +static void __init do_device_reset(struct pci_dev *dev) > do_downstream_device_reset() -- it's not resetting this pdev, > but the pdev's of the devices attached to it. > Right, I'll change them as you said. >>>> +{ >>>> + u16 ctrl; >>>> + >>>> + if (!need_reset(dev)) >>>> + return; >>>> + >>>> + dev_info(&dev->dev, "Reset Secondary bus\n"); >>>> + >>>> + /* Assert Secondary Bus Reset */ >>>> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&ctrl); >>>> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET; >>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); >>>> + >>>> + msleep(2); >>>> + > This works well for x86, which uses ioport registers to access > these <256-offset registers, b/c the write() function can't return > until the write is actually completed, but for a non-x86 system, > with fully mmconf'd PCI space, a write() may still be a write & run > (sitting in CPU write-merge) buffer, so if you need a full 2ms, > you ought to do another read_config() to the device, to flush the write, > before starting the msleep(2) clock. > I didn't know that... I'll insert something like this before msleep. pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &dummy); BTW, I referred aer_do_secondary_bus_reset() when I wrote this code. Probably need the same fix, right? >>>> + /* De-assert Secondary Bus Reset */ >>>> + ctrl&= ~PCI_BRIDGE_CTL_BUS_RESET; >>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); >>>> +} >>>> + >>>> +static int __initdata pcie_reset_devices; >>>> +static int __init reset_pcie_devices(void) > this should be reset_pcie_endpoints() ... it's not resetting all pcie devices >>>> +{ >>>> + struct pci_dev *dev = NULL; >>>> + >>>> + if (!pcie_reset_devices) >>>> + return 0; >>>> + >>>> + for_each_pci_dev(dev) >>>> + save_config(dev); >>>> + >>>> + for_each_pci_dev(dev) >>>> + do_device_reset(dev); >>>> + >>>> + msleep(1000); >>>> + >>>> + for_each_pci_dev(dev) >>>> + restore_config(dev); >>>> + > My apologies if past thread covered this sequence... > Why three loops through all PCIe devices on the system? > why not have the first for-each-pci-dev() loop filter devices > to be reset, and save_config those pdev's, > return a list of saved_pdev's; feed that list into the do_device_reset(); > then mpsleep(), then restore the list. > in fact, once you create a link list of pdev's to reset, > just loop that list doing save, then reset; rtn the list, do msleep(), > then restore the config of pdevs in the list. > Otherwise, doing much more traversing than what's needed. > Doing a great deal more config-saving then needed right now as well > (saving all non-endpt devices that aren't reset). > Creating list is nice idea, thanks. I'll do. I'll have vacation next week, so I'll post v2 patch the week after next. Thanks, Takao Indoh -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html