Re: [PATCH 1/3] PCI: Add missing link delays required by the PCIe spec

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jun 5, 2019 at 4:58 PM Mika Westerberg
<mika.westerberg@xxxxxxxxxxxxxxx> wrote:
>
> Currently Linux does not follow PCIe spec regarding the required delays
> after reset. A concrete example is a Thunderbolt add-in-card that
> consists of a PCIe switch and two PCIe endpoints:
>
>   +-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller
>                                   +-01.0-[04-36]-- DS hotplug port
>                                   +-02.0-[37]----00.0 xHCI controller
>                                   \-04.0-[38-6b]-- DS hotplug port
>
> The root port (1b.0) and the PCIe switch downstream ports are all PCIe
> gen3 so they support 8GT/s link speeds.
>
> We wait for the PCIe hierarchy to enter D3cold (runtime):
>
>   pcieport 0000:00:1b.0: power state changed by ACPI to D3cold
>
> When it wakes up from D3cold, according to the PCIe 4.0 section 5.8 the
> PCIe switch is put to reset and its power is re-applied. This means that
> we must follow the rules in PCIe 4.0 section 6.6.1.
>
> For the PCIe gen3 ports we are dealing with here, the following applies:
>
>   With a Downstream Port that supports Link speeds greater than 5.0
>   GT/s, software must wait a minimum of 100 ms after Link training
>   completes before sending a Configuration Request to the device
>   immediately below that Port. Software can determine when Link training
>   completes by polling the Data Link Layer Link Active bit or by setting
>   up an associated interrupt (see Section 6.7.3.3).
>
> Translating this into the above topology we would need to do this (DLLLA
> stands for Data Link Layer Link Active):
>
>   pcieport 0000:00:1b.0: wait for 100ms after DLLLA is set before access to 0000:01:00.0
>   pcieport 0000:02:00.0: wait for 100ms after DLLLA is set before access to 0000:03:00.0
>   pcieport 0000:02:02.0: wait for 100ms after DLLLA is set before access to 0000:37:00.0
>
> I've instrumented the kernel with additional logging so we can see the
> actual delays the kernel performs:
>
>   pcieport 0000:00:1b.0: power state changed by ACPI to D0
>   pcieport 0000:00:1b.0: waiting for D3cold delay of 100 ms
>   pcieport 0000:00:1b.0: waking up bus
>   pcieport 0000:00:1b.0: waiting for D3hot delay of 10 ms
>   pcieport 0000:00:1b.0: restoring config space at offset 0x2c (was 0x60, writing 0x60)
>   ...
>   pcieport 0000:00:1b.0: PME# disabled
>   pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   ...
>   pcieport 0000:01:00.0: PME# disabled
>   pcieport 0000:02:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   ...
>   pcieport 0000:02:00.0: PME# disabled
>   pcieport 0000:02:01.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   ...
>   pcieport 0000:02:01.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
>   pcieport 0000:02:01.0: PME# disabled
>   pcieport 0000:02:02.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   ...
>   pcieport 0000:02:02.0: PME# disabled
>   pcieport 0000:02:04.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   ...
>   pcieport 0000:02:04.0: PME# disabled
>   pcieport 0000:02:01.0: PME# enabled
>   pcieport 0000:02:01.0: waiting for D3hot delay of 10 ms
>   pcieport 0000:02:04.0: PME# enabled
>   pcieport 0000:02:04.0: waiting for D3hot delay of 10 ms
>   thunderbolt 0000:03:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x8a040000)
>   ...
>   thunderbolt 0000:03:00.0: PME# disabled
>   xhci_hcd 0000:37:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x73f00000)
>   ...
>   xhci_hcd 0000:37:00.0: PME# disabled
>
> For the switch upstream port (01:00.0) we wait for 100ms but not taking
> into account the DLLLA requirement. We then wait 10ms for D3hot -> D0
> transition of the root port and the two downstream hotplug ports. This
> means that we deviate from what the spec requires.
>
> Performing the same check for system sleep (s2idle) transitions we can
> see following when resuming from s2idle:
>
>   pcieport 0000:00:1b.0: power state changed by ACPI to D0
>   pcieport 0000:00:1b.0: restoring config space at offset 0x2c (was 0x60, writing 0x60)
>   ...
>   pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   ...
>   pcieport 0000:02:02.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   pcieport 0000:02:02.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
>   pcieport 0000:02:01.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   pcieport 0000:02:04.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   pcieport 0000:02:02.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
>   pcieport 0000:02:00.0: restoring config space at offset 0x3c (was 0x1ff, writing 0x201ff)
>   pcieport 0000:02:02.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1fff1)
>   pcieport 0000:02:01.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
>   pcieport 0000:02:02.0: restoring config space at offset 0x20 (was 0x0, writing 0x73f073f0)
>   pcieport 0000:02:04.0: restoring config space at offset 0x2c (was 0x0, writing 0x60)
>   pcieport 0000:02:01.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
>   pcieport 0000:02:00.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
>   pcieport 0000:02:02.0: restoring config space at offset 0x1c (was 0x101, writing 0x1f1)
>   pcieport 0000:02:04.0: restoring config space at offset 0x28 (was 0x0, writing 0x60)
>   pcieport 0000:02:01.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1ff10001)
>   pcieport 0000:02:00.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
>   pcieport 0000:02:02.0: restoring config space at offset 0x18 (was 0x0, writing 0x373702)
>   pcieport 0000:02:04.0: restoring config space at offset 0x24 (was 0x10001, writing 0x49f12001)
>   pcieport 0000:02:01.0: restoring config space at offset 0x20 (was 0x0, writing 0x73e05c00)
>   pcieport 0000:02:00.0: restoring config space at offset 0x24 (was 0x10001, writing 0x1fff1)
>   pcieport 0000:02:04.0: restoring config space at offset 0x20 (was 0x0, writing 0x89f07400)
>   pcieport 0000:02:01.0: restoring config space at offset 0x1c (was 0x101, writing 0x5151)
>   pcieport 0000:02:00.0: restoring config space at offset 0x20 (was 0x0, writing 0x8a008a00)
>   pcieport 0000:02:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
>   pcieport 0000:02:04.0: restoring config space at offset 0x1c (was 0x101, writing 0x6161)
>   pcieport 0000:02:01.0: restoring config space at offset 0x18 (was 0x0, writing 0x360402)
>   pcieport 0000:02:00.0: restoring config space at offset 0x1c (was 0x101, writing 0x1f1)
>   pcieport 0000:02:04.0: restoring config space at offset 0x18 (was 0x0, writing 0x6b3802)
>   pcieport 0000:02:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
>   pcieport 0000:02:00.0: restoring config space at offset 0x18 (was 0x0, writing 0x30302)
>   pcieport 0000:02:01.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
>   pcieport 0000:02:04.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
>   pcieport 0000:02:00.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020)
>   pcieport 0000:02:01.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
>   pcieport 0000:02:04.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
>   pcieport 0000:02:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
>   xhci_hcd 0000:37:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x73f00000)
>   ...
>   thunderbolt 0000:03:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x8a040000)
>
> This is even worse. None of the mandatory delays are performed. If this
> would be S3 instead of s2idle then according to PCI FW spec 3.2 section
> 4.6.8.  there is a specific _DSM that allows the OS to skip the delays
> but this platform does not provide the _DSM and does not go to S3 anyway
> so no firmware is involved that could already handle these delays.
>
> In this particular Intel Coffee Lake platform these delays are not
> actually needed because there is an additional delay as part of the ACPI
> power resource that is used to turn on power to the hierarchy but since
> that additional delay is not required by any of standards (PCIe, ACPI)
> it is not present in the Intel Ice Lake, for example where missing the
> mandatory delays causes pciehp to start tearing down the stack too early
> (links are not yet trained).
>
> For this reason, change the PCIe portdrv PM resume hooks so that they
> perform the mandatory delays before the downstream component gets
> resumed. We perform the delays before port services are resumed because
> otherwise pciehp might find that the link is not up (even if it is just
> training) and tears-down the hierarchy.
>
> Signed-off-by: Mika Westerberg <mika.westerberg@xxxxxxxxxxxxxxx>

Generally

Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>

with a couple of nits below.

> ---
>  drivers/pci/pci.c               | 29 +++++++++------
>  drivers/pci/pci.h               |  1 +
>  drivers/pci/pcie/portdrv_core.c | 62 +++++++++++++++++++++++++++++++++
>  3 files changed, 82 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 8abc843b1615..87a1f902fa8e 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1004,15 +1004,10 @@ static void __pci_start_power_transition(struct pci_dev *dev, pci_power_t state)
>         if (state == PCI_D0) {
>                 pci_platform_power_transition(dev, PCI_D0);
>                 /*
> -                * Mandatory power management transition delays, see
> -                * PCI Express Base Specification Revision 2.0 Section
> -                * 6.6.1: Conventional Reset.  Do not delay for
> -                * devices powered on/off by corresponding bridge,
> -                * because have already delayed for the bridge.
> +                * Mandatory power management transition delays are
> +                * handled in the PCIe portdrv resume hooks.
>                  */
>                 if (dev->runtime_d3cold) {
> -                       if (dev->d3cold_delay && !dev->imm_ready)
> -                               msleep(dev->d3cold_delay);
>                         /*
>                          * When powering on a bridge from D3cold, the
>                          * whole hierarchy may be powered on into
> @@ -4568,14 +4563,16 @@ static int pci_pm_reset(struct pci_dev *dev, int probe)
>
>         return pci_dev_wait(dev, "PM D3->D0", PCIE_RESET_READY_POLL_MS);
>  }
> +
>  /**
> - * pcie_wait_for_link - Wait until link is active or inactive
> + * pcie_wait_for_link_delay - Wait until link is active or inactive
>   * @pdev: Bridge device
>   * @active: waiting for active or inactive?
> + * @delay: Delay to wait after link has become active (in ms)
>   *
>   * Use this to wait till link becomes active or inactive.
>   */
> -bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
> +bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active, int delay)
>  {
>         int timeout = 1000;
>         bool ret;
> @@ -4612,13 +4609,25 @@ bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
>                 timeout -= 10;
>         }
>         if (active && ret)
> -               msleep(100);
> +               msleep(delay);
>         else if (ret != active)
>                 pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
>                         active ? "set" : "cleared");
>         return ret == active;
>  }
>
> +/**
> + * pcie_wait_for_link - Wait until link is active or inactive
> + * @pdev: Bridge device
> + * @active: waiting for active or inactive?
> + *
> + * Use this to wait till link becomes active or inactive.
> + */
> +bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
> +{
> +       return pcie_wait_for_link_delay(pdev, active, 100);
> +}
> +
>  void pci_reset_secondary_bus(struct pci_dev *dev)
>  {
>         u16 ctrl;
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 9cb99380c61e..59802b3def4b 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -493,6 +493,7 @@ static inline int pci_dev_specific_disable_acs_redir(struct pci_dev *dev)
>  void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>                       u32 service);
>
> +bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active, int delay);
>  bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
>  #ifdef CONFIG_PCIEASPM
>  void pcie_aspm_init_link_state(struct pci_dev *pdev);
> diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> index 1b330129089f..88d151a54be6 100644
> --- a/drivers/pci/pcie/portdrv_core.c
> +++ b/drivers/pci/pcie/portdrv_core.c
> @@ -9,6 +9,7 @@
>  #include <linux/module.h>
>  #include <linux/pci.h>
>  #include <linux/kernel.h>
> +#include <linux/delay.h>
>  #include <linux/errno.h>
>  #include <linux/pm.h>
>  #include <linux/pm_runtime.h>
> @@ -378,6 +379,58 @@ static int pm_iter(struct device *dev, void *data)
>         return 0;
>  }
>
> +static int get_downstream_delay(struct pci_bus *bus)
> +{
> +       struct pci_dev *pdev;
> +       int min_delay = 100;
> +       int max_delay = 0;
> +
> +       list_for_each_entry(pdev, &bus->devices, bus_list) {
> +               if (!pdev->imm_ready)
> +                       min_delay = 0;
> +               else if (pdev->d3cold_delay < min_delay)
> +                       min_delay = pdev->d3cold_delay;
> +               if (pdev->d3cold_delay > max_delay)
> +                       max_delay = pdev->d3cold_delay;
> +       }
> +
> +       return max(min_delay, max_delay);
> +}
> +
> +static void wait_for_downstream_link(struct pci_dev *pdev)
> +{
> +       /*
> +        * Handle delays according to PCIe 4.0 section 6.6.1 before
> +        * configuration access to the downstream component is permitted.
> +        *
> +        * This blocks PCI core resume of the hierarchy below this port
> +        * until the link is trained.
> +        */

Shouldn't the above go to a kerneldoc comment?

> +       if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT ||
> +            pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM) &&
> +            pdev->subordinate && !list_empty(&pdev->subordinate->devices) &&
> +            pdev->bridge_d3 && !pci_dev_is_disconnected(pdev)) {

There's nothing except for this if () in the function, so maybe just
return here if the condition is not satisfied?

Also maybe split it into a port type check (you can return right away
if it fails), disconnected check and the rest?

> +               int delay;
> +
> +               delay = get_downstream_delay(pdev->subordinate);
> +               if (!delay)
> +                       return;
> +
> +               dev_dbg(&pdev->dev, "waiting downstream link for %d ms\n", delay);
> +
> +               /*
> +                * If downstream port does not support speeds greater than
> +                * 5 GT/s need to wait 100ms. For higher speeds (gen3) we
> +                * need to wait first for the data link layer to become
> +                * active.
> +                */
> +               if (pcie_get_speed_cap(pdev) <= PCIE_SPEED_5_0GT)
> +                       msleep(delay);
> +               else
> +                       pcie_wait_for_link_delay(pdev, true, delay);
> +       }
> +}
> +
>  /**
>   * pcie_port_device_suspend - suspend port services associated with a PCIe port
>   * @dev: PCI Express port to handle
> @@ -391,6 +444,13 @@ int pcie_port_device_suspend(struct device *dev)
>  int pcie_port_device_resume_noirq(struct device *dev)
>  {
>         size_t off = offsetof(struct pcie_port_service_driver, resume_noirq);
> +
> +       /*
> +        * Wait for the link to be fully up before resuming port services.
> +        * This prevents pciehp from starting to tear-down the hierarchy
> +        * too soon.
> +        */
> +       wait_for_downstream_link(to_pci_dev(dev));
>         return device_for_each_child(dev, &off, pm_iter);
>  }
>
> @@ -421,6 +481,8 @@ int pcie_port_device_runtime_suspend(struct device *dev)
>  int pcie_port_device_runtime_resume(struct device *dev)
>  {
>         size_t off = offsetof(struct pcie_port_service_driver, runtime_resume);
> +

The comment from pcie_port_device_resume_noirq() is applicable here
too, so maybe move it to a common place (kerneldoc?).

> +       wait_for_downstream_link(to_pci_dev(dev));
>         return device_for_each_child(dev, &off, pm_iter);
>  }
>  #endif /* PM */
> --
> 2.20.1
>



[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux