On Mon, 22 Jan 2024 15:50:03 -0700 Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: > On Mon, 22 Jan 2024 23:17:30 +0100 > Lukas Wunner <lukas@xxxxxxxxx> wrote: > > > On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote: > > > On Thu, 3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@xxxxxxxxxx wrote: > > > > Testing that a device is not currently in a low power state provides no > > > > guarantees that the device is not immenently transitioning to such a state. > > > > We need to increment the PM usage counter before accessing the device. > > > > Since we don't wish to wake the device for PME polling, do so only if the > > > > device is already active by using pm_runtime_get_if_active(). > > > > > > > > Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx> > > > > --- > > > > drivers/pci/pci.c | 23 ++++++++++++++++------- > > > > 1 file changed, 16 insertions(+), 7 deletions(-) > > > > > > Resurrecting this patch (currently commit d3fcd7360338) for discussion > > > as it's been identified as the source of a regression in: > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=218360 > > > > > > Copying Mika, Lukas, and Rafael as it's related to: > > > > > > 000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold") > > > > > > where we skip devices in D3cold when processing the PME list. > > > > > > I think the issue in the above bz is that the downstream TB3/USB4 port > > > is in D3 (presumably D3hot) and I therefore infer the device is in state > > > RPM_SUSPENDED. This commit is attempting to make sure the device power > > > state is stable across the call such that it does not transition into > > > D3cold while we're accessing it. > > > > > > To do that I used pm_runtime_get_if_active(), but in retrospect this > > > requires the device to be in RPM_ACTIVE so we end up skipping anything > > > suspended or transitioning. > > > > How about dropping the calls to pm_runtime_get_if_active() and > > pm_runtime_put() and instead simply do: > > > > if (pm_runtime_suspended(&pdev->dev) && > > pdev->current_state != PCI_D3cold) > > pci_pme_wakeup(pdev, NULL); > > Hi Lukas, > > Do we require that the polled device is in the RPM_SUSPENDED state? > Also pm_runtime_suspended() can also only be trusted while holding the > device power.lock, we need a usage count reference to maintain that > state. > > I'm also seeing cases where the bridge is power state D0, but PM state > RPM_SUSPENDING, so config space of the polled device becomes > inaccessible even while we're holding a reference once we allow polling > in RPM_SUSPENDED. > > I'm currently working with the below patch, which I believe addresses > all these issues, but I'd welcome review and testing. Thanks, > > Alex > > commit 0a063b8e91d0bc807db712c81c8b270864f99ecb > Author: Alex Williamson <alex.williamson@xxxxxxxxxx> > Date: Tue Jan 16 13:28:33 2024 -0700 > > PCI: Fix active state requirement in PME polling > > The commit noted in fixes added a bogus requirement that runtime PM > managed devices need to be in the RPM_ACTIVE state for PME polling. > In fact, there is no requirement of a specific runtime PM state, it > is only required that the state is stable such that testing config > space availability, ie. !D3cold, remains valid across the PME wakeup. > > To that effect, defer polling of runtime PM managed devices that are > not in either the RPM_ACTIVE or RPM_SUSPENDED states. Devices in > transitional states remain on the pci_pme_list and will be re-queued. > > However in allowing polling of devices in the RPM_SUSPENDED state, > the bridge state requires further refinement as it's possible to poll > while the bridge is in D0, but the runtime PM state is RPM_SUSPENDING. > An asynchronous completion of the bridge transition to a low power > state can make config space of the subordinate device become > unavailable. A runtime PM reference to the bridge is therefore added > with a supplementary requirement that the bridge is in the RPM_ACTIVE > state. > > Fixes: d3fcd7360338 ("PCI: Fix runtime PM race with PME polling") > Reported-by: Sanath S <sanath.s@xxxxxxx> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218360 > Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index bdbf8a94b4d0..31dbf1834b07 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -2433,29 +2433,45 @@ static void pci_pme_list_scan(struct work_struct *work) > if (pdev->pme_poll) { > struct pci_dev *bridge = pdev->bus->self; > struct device *dev = &pdev->dev; > - int pm_status; > + struct device *bdev = bridge ? &bridge->dev : NULL; > > /* > - * If bridge is in low power state, the > - * configuration space of subordinate devices > - * may be not accessible > + * If we have a bridge, it should be in an active/D0 > + * state or the configuration space of subordinate > + * devices may not be accessible. > */ > - if (bridge && bridge->current_state != PCI_D0) > - continue; > + if (bdev) { > + spin_lock_irq(&bdev->power.lock); With the code as shown here I have one system that seems to be getting contention when reading the vpd sysfs attribute when the endpoints (QL41000) are bound to vfio-pci and unused, resulting in the root port and endpoints being suspended. A vpd read can take over a minute. Seems to be resolved changing the above spin_lock to a spin_trylock: if (!spin_trylock_irq(&bdev->power.lock)) continue; The pm_runtime_barrier() as used in the vpd path can be prone to such issues, I saw similar in the fix I previously proposed in the bz. I'll continue to do more testing with this change and hopefully Sanath can verify this resolves the bug report. Thanks, Alex > + if (!pm_runtime_active(bdev) || > + bridge->current_state != PCI_D0) { > + spin_unlock_irq(&bdev->power.lock); > + continue; > + } > + pm_runtime_get_noresume(bdev); > + spin_unlock_irq(&bdev->power.lock); > + } > > /* > - * If the device is in a low power state it > - * should not be polled either. > + * The device itself may be either in active or > + * suspended state, but must not be in D3cold so > + * that configuration space is accessible. The > + * transitional resuming and suspending states are > + * skipped to avoid D3cold races. > */ > - pm_status = pm_runtime_get_if_active(dev, true); > - if (!pm_status) > - continue; > - > - if (pdev->current_state != PCI_D3cold) > + spin_lock_irq(&dev->power.lock); > + if ((pm_runtime_active(dev) || > + pm_runtime_suspended(dev)) && > + pdev->current_state != PCI_D3cold) { > + pm_runtime_get_noresume(dev); > + spin_unlock_irq(&dev->power.lock); > pci_pme_wakeup(pdev, NULL); > - > - if (pm_status > 0) > pm_runtime_put(dev); > + } else { > + spin_unlock_irq(&dev->power.lock); > + } > + > + if (bdev) > + pm_runtime_put(bdev); > } else { > list_del(&pme_dev->list); > kfree(pme_dev);