Re: [PATCH v2 1/3] PCI / ACPI: Use cached ACPI device state to get PCI device power state

Bjorn Helgaas <helgaas@xxxxxxxxxx> · Fri, 21 Jun 2019 08:09:20 -0500

On Fri, Jun 21, 2019 at 12:32:22PM +0200, Rafael J. Wysocki wrote:
> On Thu, Jun 20, 2019 at 4:15 PM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > On Thu, Jun 20, 2019 at 04:37:10PM +0300, Mika Westerberg wrote:
> > > On Thu, Jun 20, 2019 at 08:16:49AM -0500, Bjorn Helgaas wrote:
> > > > On Thu, Jun 20, 2019 at 11:27:30AM +0300, Mika Westerberg wrote:
> > > > > On Wed, Jun 19, 2019 at 04:28:01PM -0500, Bjorn Helgaas wrote:
> > > > > > On Tue, Jun 18, 2019 at 07:18:56PM +0300, Mika Westerberg wrote:
> > > > > > > Intel Ice Lake has an integrated Thunderbolt controller which
> > > > > > > means that the PCIe topology is extended directly from the two
> > > > > > > root ports (RP0 and RP1).
> > > > > >
> > > > > > A PCIe topology is always extended directly from root ports,
> > > > > > regardless of whether a Thunderbolt controller is integrated, so I
> > > > > > guess I'm missing the point you're making.  It doesn't sound like
> > > > > > this is anything specific to Thunderbolt?
> > > > >
> > > > > The point I'm trying to make here is to explain why this is problem
> > > > > now and not with the previous discrete controllers. With the
> > > > > previous there was only a single ACPI power resource for the root
> > > > > port and the Thunderbolt host router was connected to that root
> > > > > port. PCIe hierarchy was extended through downstream ports (not root
> > > > > ports) of that controller (which includes PCIe switch).
> > > >
> > > > Sounds like you're using "PCIe topology extension" to mean
> > > > specifically something below a Thunderbolt controller, excluding a
> > > > subtree below a root port.  I don't think the PCI core is aware of
> > > > that distinction.
> > >
> > > Right it is not.
> > >
> > > > > Now the thing is part of the SoC so power management is different
> > > > > and causes problems in Linux.
> > > >
> > > > The SoC is a physical packaging issue that really doesn't enter into
> > > > the specs directly.  I'm trying to get at the logical topology
> > > > questions in terms of the PCIe and ACPI specs.
> > > >
> > > > I assume we could dream up a non-Thunderbolt topology that would show
> > > > the same problem?
> > >
> > > Yes.
> > >
> > > > > > > Power management is handled by ACPI power resources that are
> > > > > > > shared between the root ports, Thunderbolt controller (NHI) and xHCI
> > > > > > > controller.
> > > > > > >
> > > > > > > The topology with the power resources (marked with []) looks like:
> > > > > > >
> > > > > > >   Host bridge
> > > > > > >     |
> > > > > > >     +- RP0 ---\
> > > > > > >     +- RP1 ---|--+--> [TBT]
> > > > > > >     +- NHI --/   |
> > > > > > >     |            |
> > > > > > >     |            v
> > > > > > >     +- xHCI --> [D3C]
> > > > > > >
> > > > > > > Here TBT and D3C are the shared ACPI power resources. ACPI
> > > > > > > _PR3() method returns either TBT or D3C or both.
> > > >
> > > > I'm not very familiar with _PR3.  I guess this is under an ACPI object
> > > > representing a PCI device, e.g., \_SB.PCI0.RP0._PR3?
> > >
> > > Correct.
> > >
> > > > > > > Say we runtime suspend first the root ports RP0 and RP1, then
> > > > > > > NHI. Now since the TBT power resource is still on when the root
> > > > > > > ports are runtime suspended their dev->current_state is set to
> > > > > > > D3hot. When NHI is runtime suspended TBT is finally turned off
> > > > > > > but state of the root ports remain to be D3hot.
> > > >
> > > > So in this example we might have:
> > > >
> > > >   _SB.PCI0.RP0._PR3: TBT
> > > >   _SB.PCI0.RP1._PR3: TBT
> > > >   _SB.PCI0.NHI._PR3: TBT
> > >
> > > and also D3C.
> > >
> > > > And when Linux figures out that everything depending on TBT is in
> > > > D3hot, it evaluates TBT._OFF, which puts them all in D3cold?  And part
> > > > of the problem is that they're now in D3cold (where config access
> > > > doesn't work) but Linux still thinks they're in D3hot (where config
> > > > access would work)?
> > >
> > > Exactly.
> > >
> > > > I feel like I'm missing something because I don't know how D3C is
> > > > involved, since you didn't mention suspending xHCI.
> > >
> > > That's another power resource so we will also have D3C turned off when
> > > xHCI gets suspended but I did not want to complicate things too much in
> > > the changelog.
> >
> > If D3C isn't essential to seeing this problem, you could just omit it
> > altogether.  I think stripping out anything that's not essential will
> > make it easier to think about the underlying issues.
> >
> > > > And I can't mentally match up the patch with the D3hot/D3cold state
> > > > change (if indeed that's the problem).  If we were updating the path
> > > > that evaluates _OFF so it changed the power state of all dependent
> > > > devices, *that* would make a lot of sense to me because it sounds like
> > > > that's where the physical change happens that makes things out of
> > > > sync.
> > >
> > > I did that in the first version [1] but Rafael pointed out that it is
> > > racy one way or another [2].
> > >
> > > [1] https://www.spinics.net/lists/linux-pci/msg83583.html
> > > [2] https://www.spinics.net/lists/linux-pci/msg83600.html
> >
> > Yeah, interesting.  It was definitely a much larger patch.  I don't
> > know enough to comment on the races.
> 
> Say two power resources are listed by _PR3 for one device (because why
> not?) and you want to change the device's state to D3cold only if the
> two power resources are both "off".  Then, you need some locking (or
> equivalent) to synchronize two power resources with each other, so
> that you can change the devices state when the last of them goes _OFF.
> Currently, there is no such synchronization between power resources
> other then the "system_level" value which may not be reliable enough
> for this type of use.
> 
> Or you can say that the device is in D3cold if at least one of the
> power resources is _OFF, but IMO that may not really be consistent
> with the view that the "logical" power state of the device should
> reflect the physical reality accurately.
> 
> > I would wonder whether there's a way to get rid of the caches that become stale,
> 
> I guess what you mean is that the "cached" (or rather "logical" or
> "expected") power state value may become different from what is
> returned by acpi_device_get_power() for the device.
> 
> The problem here is that acpi_device_get_power() really only should be
> used for two purposes: (1) To initialize adev->power.state, or to
> update it via acpi_device_update_power(), and (2) by the
> "real_power_state" sysfs attribute (of ACPI device objects).  The
> adev->power.state value should be used anywhere else, in principle, so
> the Mika's patch is correct.
> 
> [Note that adev->power.state cannot be updated after calling
> acpi_device_get_power() to the value returned by it without updating
> the reference counters of the power resources that are "on" *exactly*
> because of the problem at hand here.]
> 
> > but that's just an idle thought, not a suggestion.
> 
> After the initialization of the ACPI subsystem, the authoritative
> source of the ACPI device power state information is
> adev->power.state.  The ACPI subsystem is expected to update this
> value as needed going forward (including system-wide transitions like
> resume from S3).

Thanks, this is all very helpful!  Do you by any chance add
lore.kernel.org links to commit logs when applying patches?  This is a
case where I think the discussion could be useful in the future.

Link: https://lore.kernel.org/r/20190618161858.77834-2-mika.westerberg@xxxxxxxxxxxxxxx