Re: [PATCH v4] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges

Karol Herbst <kherbst@xxxxxxxxxx> · Thu, 21 Nov 2019 23:50:26 +0100

On Thu, Nov 21, 2019 at 11:39 PM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
>
> On Thu, Nov 21, 2019 at 8:49 PM Mika Westerberg
> <mika.westerberg@xxxxxxxxx> wrote:
> >
> > On Thu, Nov 21, 2019 at 04:43:24PM +0100, Rafael J. Wysocki wrote:
> > > On Thu, Nov 21, 2019 at 1:52 PM Mika Westerberg
> > > <mika.westerberg@xxxxxxxxx> wrote:
> > > >
> > > > On Thu, Nov 21, 2019 at 01:46:14PM +0200, Mika Westerberg wrote:
> > > > > On Thu, Nov 21, 2019 at 12:34:22PM +0100, Rafael J. Wysocki wrote:
> > > > > > On Thu, Nov 21, 2019 at 12:28 PM Mika Westerberg
> > > > > > <mika.westerberg@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > On Wed, Nov 20, 2019 at 11:29:33PM +0100, Rafael J. Wysocki wrote:
> > > > > > > > > last week or so I found systems where the GPU was under the "PCI
> > > > > > > > > Express Root Port" (name from lspci) and on those systems all of that
> > > > > > > > > seems to work. So I am wondering if it's indeed just the 0x1901 one,
> > > > > > > > > which also explains Mikas case that Thunderbolt stuff works as devices
> > > > > > > > > never get populated under this particular bridge controller, but under
> > > > > > > > > those "Root Port"s
> > > > > > > >
> > > > > > > > It always is a PCIe port, but its location within the SoC may matter.
> > > > > > >
> > > > > > > Exactly. Intel hardware has PCIe ports on CPU side (these are called
> > > > > > > PEG, PCI Express Graphics, ports), and the PCH side. I think the IP is
> > > > > > > still the same.
> > > > > > >
> > > > > > > > Also some custom AML-based power management is involved and that may
> > > > > > > > be making specific assumptions on the configuration of the SoC and the
> > > > > > > > GPU at the time of its invocation which unfortunately are not known to
> > > > > > > > us.
> > > > > > > >
> > > > > > > > However, it looks like the AML invoked to power down the GPU from
> > > > > > > > acpi_pci_set_power_state() gets confused if it is not in PCI D0 at
> > > > > > > > that point, so it looks like that AML tries to access device memory on
> > > > > > > > the GPU (beyond the PCI config space) or similar which is not
> > > > > > > > accessible in PCI power states below D0.
> > > > > > >
> > > > > > > Or the PCI config space of the GPU when the parent root port is in D3hot
> > > > > > > (as it is the case here). Also then the GPU config space is not
> > > > > > > accessible.
> > > > > >
> > > > > > Why would the parent port be in D3hot at that point?  Wouldn't that be
> > > > > > a suspend ordering violation?
> > > > >
> > > > > No. We put the GPU into D3hot first,
> > >
> > > OK
> > >
> > > Does this involve any AML, like a _PS3 under the GPU object?
> >
> > I don't see _PS3 (nor _PS0) for that object. If I read it right the GPU
> > itself is not described in ACPI tables at all.
>
> OK
>
> > > > > then the root port and then turn
> > > > > off the power resource (which is attached to the root port) resulting
> > > > > the topology entering D3cold.
> > > >
> > > > I don't see that happening in the AML though.
> > >
> > > Which AML do you mean, specifically?  The _OFF method for the root
> > > port's _PR3 power resource or something else?
> >
> > The root port's _OFF method for the power resource returned by its _PR3.
>
> OK, so without the $subject patch we (1) program the downstream
> component (GPU) into D3hot, then we (2) program the port holding it
> into D3hot and then we (3) let the AML (_OFF for the power resource
> listed by _PR3 under the port object) run.
>
> Something strange happens at this point (and I guess that _OFF doesn't
> even reach the point where it removes power from the port which is why
> we see a lock-up).
>

it does though. I will post the data shortly (with the change in power
consumption), as I also want to do the ACPI traces now.

> We know that skipping (1) makes things work and we kind of suspect
> that skipping (3) would make things work either, but what about doing
> (1) and (3) without (2)?
>
> > > > Basically the difference is that when Windows 7 or Linux (the _REV==5
> > > > check) then we directly do link disable whereas in Windows 8+ we invoke
> > > > LKDS() method that puts the link into L2/L3. None of the fields they
> > > > access seem to touch the GPU itself.
> > >
> > > So that may be where the problem is.
> > >
> > > Putting the downstream component into PCI D[1-3] is expected to put
> > > the link into L1, so I'm not sure how that plays with the later
> > > attempt to put it into L2/L3 Ready.
> >
> > That should be fine. What I've seen the link goes into L1 when
> > downstream component is put to D-state (not D0) and then it is put back
> > to L0 when L2/3 ready is propagated. Eventually it goes into L2 or L3.
>
> Well, that's the expected behavior, but the observed behavior isn't as
> expected. :-)
>
> > > Also, L2/L3 Ready is expected to be transient, so finally power should
> > > be removed somehow.
> >
> > There is GPIO for both power and PERST, I think the line here:
> >
> >   \_SB.SGOV (0x01010004, Zero)
> >
> > is the one that removes power.
>
> OK
>
> > > > LKDS() for the first PEG port looks like this:
> > > >
> > > >    P0L2 = One
> > > >    Sleep (0x10)
> > > >    Local0 = Zero
> > > >    While (P0L2)
> > > >    {
> > > >         If ((Local0 > 0x04))
> > > >         {
> > > >             Break
> > > >         }
> > > >
> > > >         Sleep (0x10)
> > > >         Local0++
> > > >    }
> > > >
> > > > One thing that comes to mind is that the loop can end even if P0L2 is
> > > > not cleared as it does only 5 iterations with 16 ms sleep between. Maybe
> > > > Sleep() is implemented differently in Windows? I mean Linux may be
> > > > "faster" here and return prematurely and if we leave the port into D0
> > > > this does not happen, or something. I'm just throwing out ideas :)
> > >
> > > But this actually works for the downstream component in D0, doesn't it?
> >
> > It does and that leaves the link in L0 so it could be that then the
> > above AML works better or something.
>
> That would be my guess.
>
> > That reminds me, ASPM may have something to do with this as well.
>
> Not really if D-states are involved.
>
> > > Also, if the downstream component is in D0, the port actually should
> > > stay in D0 too, so what would happen with the $subject patch applied?
> >
> > Parent port cannot be lower D-state than the child so I agree it should
> > stay in D0 as well. However, it seems that what happens is that the
> > issue goes away :)
>
> Well, at least this is kind of out of the spec.
>
> Note that pci_pm_suspend_noirq() won't let the port go into D3 if the
> downstream device is in D0, so the $subject patch will not work as
> expected in the suspend-to-idle case.
>
> Also we really should make up our minds on whether or not to force
> PCIe ports to stay in D0 when downstream devices are in D0 and be
> consequent about that.  Right now we do one thing during system-wide
> suspend and the other one in PM-runtime, which is confusing.
>
> The current design is mostly based on the PCI PM Spec 1.2, so it would
> be consequent to follow system-wide suspend in PM-runtime and avoid
> putting PCIe ports holding devices in D0 into any low-power states.
> but that would make the approach in the $subject patch ineffective.
>
> Moreover, the fact that there are separate branches for "Windows 7"
> and "Windows 8+" kind of suggest a change in the expected behavior
> between Windows 7 and Windows 8, from the AML perspective.  I would
> guess that Windows 7 followed PCI PM 1.2 and Windows 8 (and later)
> does something else.  Now, the structure of the "Windows 8+" branch
> described by you suggests that, at least in the cases when it is going
> to remove power from the port eventually, it goes straight for the
> link preparation (the L2/L3 Ready transition) and power removal
> without bothering to program the downstream device and port into D3hot
> (because that's kind of redundant).
>
> That hypothetical "Windows 8+" approach may really work universally,
> because it doesn't seem to break any rules (going straight from D0 to
> D3cold is not disallowed and doing that for both a port and a
> downstream device at the same time is kind of OK either, as long as
> the link is ready for that).
>