Re: [PATCH v4] PCI/ASPM: Add back L1 PM Substate save and restore

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2023-10-09 at 11:34 -0500, Bjorn Helgaas wrote:
> On Mon, Oct 09, 2023 at 11:34:34AM +0300, Mika Westerberg wrote:
> > On Thu, Oct 05, 2023 at 12:22:26PM -0500, Bjorn Helgaas wrote:
> > > On Thu, Oct 05, 2023 at 05:57:52PM +0200, Thomas Witt wrote:
> > > > On 05/10/2023 17:30, Bjorn Helgaas wrote:
> > > > ...
> > > 
> > > > > Right, without the denylist, I expect Thomas' TUXEDO to fail,
> > > > > but I still hope we can figure out why.  If we just keep it on
> > > > > the denylist, that system will suffer from more power
> > > > > consumption than necessary, but only after suspend/resume.
> > > > > 
> > > > > A denylist seems like the absolute last resort.  In this case
> > > > > we don't know about anything *wrong* with those platforms; all
> > > > > we know is that our resume path doesn't work.  It's likely
> > > > > that it fails on other platforms we haven't heard about, too.
> > > > 
> > > > The best guess from Mika and David was a firmware issue, but I
> > > > run the same Firmware revision as Werner. I even reflashed the
> > > > Firmware, but that did not change anything:
> > > 
> > > Possibly a BIOS settings difference?
> > > 
> > > > Quoting David Box:
> > > > > I agree that we should pursue an exception for your system.
> > > > > This is looking like a firmware bug. One thing we did notice
> > > > > in the turbostat results is your IRTL (Interrupt Response Time
> > > > > Limit) values are bogus:
> > > > > 
> > > > > cpu6: MSR_PKGC3_IRTL: 0x0000884e (valid, 79872 ns)
> > > > > cpu6: MSR_PKGC6_IRTL: 0x00008000 (valid, 0 ns)
> > > > > cpu6: MSR_PKGC7_IRTL: 0x00008000 (valid, 0 ns)
> > > > > cpu6: MSR_PKGC8_IRTL: 0x00008000 (valid, 0 ns)
> > > > > cpu6: MSR_PKGC9_IRTL: 0x00008000 (valid, 0 ns)
> > > > > cpu6: MSR_PKGC10_IRTL: 0x00008000 (valid, 0 ns)
> > > > > 
> > > > > This is despite the PKGC configuration register showing that all
> > > > > states are enabled:
> > > > > 
> > > > > cpu6: MSR_PKG_CST_CONFIG_CONTROL: 0x1e008008 (UNdemote-C3, UNdemote-
> > > > > C1,
> > > > demote-
> > > > C3, demote-C1, locked, pkg-cstate-limit=8 (unlimited))
> > > > > 
> > > > > Firmware sets this.
> > > 
> > > I can't find this discussion, but if there's a firmware issue related
> > > to IRTL MSRs, I would want the workaround in intel-idle.c or whatever
> > > code deals with the MSRs, not in the ASPM code.
> > 
> > Unfortunately that discussion never ended up on the mailing list. But in
> > summary that particular system seems to run pretty hot (if I understand
> > correctly what David concluded). This is the reason Thomas has the
> > pm_sleep=deep in the command line and this is why the L1 SS restore then
> > causes the failure on resume. Without this it works fine but consumes
> > lot of energy in s2idle.
> > 
> > Can you suggest what we should do with this now?
> > 
> > We got report from Tasev Nikola on
> > https://bugzilla.kernel.org/show_bug.cgi?id=216782 that even if the Asus
> > system is removed from the denylist it works so that we can do. However,
> > with the Thomas' system I'm not sure. If we leave it like this:
> > 
> > static int aspm_l1ss_suspend_via_firmware(const struct dmi_system_id
> > *not_used)
> > {
> >         return pm_suspend_via_firmware();
> > }
> > 
> > static const struct dmi_system_id aspm_l1ss_denylist[] = {
> >         {
> >                 /*
> >                  * https://bugzilla.kernel.org/show_bug.cgi?id=216877
> >                  *
> >                  * This system needs to use suspend to mem instead of its 
> >                  * default (suspend to idle) to avoid draining the battery. 
> >                  * However, the BIOS gets confused if we try to restore the
> >                  * L1SS registers so avoid doing that if the user forced
> >                  * suspend to mem. The default suspend to idle on the other
> >                  * hand needs restoring L1SS to allow the CPU to enter low
> >                  * power states. This entry should handle both.
> >                  */
> >                 .callback = aspm_l1ss_suspend_via_firmware,
> >                 .ident = "TUXEDO InfinityBook S 14 v5",
> >                 .matches = {
> >                         DMI_MATCH(DMI_BOARD_VENDOR, "TUXEDO"),
> >                         DMI_MATCH(DMI_BOARD_NAME, "L140CU"),
> >                 },
> >         },
> >         { }
> > };
> > 
> > Then it should work with Thomas' system (defaults to deep), and TUXEDO
> > with default settings (defaults to s2idle), and with the rest of the
> > world (I hope at least, fingers crossed). Or you want to pursue a
> > solution without the denylist still? I'm out of ideas what could be
> > wrong except that when the pm_sleep=deep it means the BIOS is involved
> > in the suspend/restore of the devices and it may not expect the OS to
> > touch these registers or something along those lines.
> 
> [I think this refers to "mem_sleep_default=deep" (not pm_sleep)]
> 
> I assume everything was fine before suspend, and it only runs hot
> after resume.

The system was running hot even before suspend. After boot it could not exit
Package C0. At that level it's usually some compute element that is blocking,
and not caused by a device's LTR. A cause for this was not found, only signs
that BIOS did not configure the system correctly (bogus IRTL values).

That issue does appear to be why Thomas needs to use deep since his system
cannot use s2idle since it cannot even idle while running. There's no relation I
can see between that issue and the PCI device failure after L1 substate
restore. 

But to your point, those PCI devices were working upon boot, with L1 configured,
and not working after resume when restored on that system.

>   And the platform granted control of the PCIe Capability
> and the LTR Capability to OSPM via _OSC?

Need to confirm this.

> 
> If so, I think we should try to find out what the difference is, e.g.,
> compare config space before/after the suspend/resume.  Maybe that's
> already been tried?  (I did check the archives but couldn't find
> details.)

I'll take a look again at the information we got from Thomas and check for
config space differences.

David




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux