On 05/10/2023 17:30, Bjorn Helgaas wrote:
On Thu, Oct 05, 2023 at 12:02:58PM +0300, Mika Westerberg wrote:
On Wed, Oct 04, 2023 at 05:23:24PM -0500, Bjorn Helgaas wrote:
...
The thing with TUXEDO is that on Thomas's system with mem_sleep=deep
this patch (without denylist) breaks the resume as he describes here:
https://bugzilla.kernel.org/show_bug.cgi?id=216877
However, on exact same TUXEDO system with the same firmwares Werner is
not able to reproduce the issue with or without the patch. So I'm not
sure what to do and that's why I added denylist that should take effect
on Thomas' system when mem_sleep=deep is set but also work on the same
system without it.
And since we have the denylist, I decided to add the ASUS there to avoid
accidentally breaking those too.
...
I think there's still something we're missing.
We restore the LTR config before restoring DEVCTL2 (including the LTR
enable bit) and L1SS state. I don't think we know the state of ASPM
and L1SS at that point, do we? Do you think there could be an issue
there, too?
AFAICT LTR does not affect until it is explicitly enabled and I don't
think many drivers actually program it (although we have some sort of
API for it at least for Intel LPSS devices).
I couldn't find anything in the spec that suggests LTR should be an
issue. I'm just grasping at straws here.
There's obviously *something* we're doing wrong because ASPM worked
before suspend, so we should be able to get it to work after resume.
Could we learn anything by dumping config space of the problem devices
before/after the suspend/resume and comparing them?
If we could figure out a difference between Werner's working TUXEDO
and Thomas' non-working TUXEDO, that might be a hint, too.
If you have suggestions, I'm all open. If I understand you would like
this to be done like:
- Drop the denylist
- Add back the suspend/restore of L1SS
- Ask everyone in this thread to try it out
I can do that no problem but I guess that for the TUXEDO one (Thomas')
it probably is going to fail still.
Right, without the denylist, I expect Thomas' TUXEDO to fail, but I
still hope we can figure out why. If we just keep it on the denylist,
that system will suffer from more power consumption than necessary,
but only after suspend/resume.
A denylist seems like the absolute last resort. In this case we don't
know about anything *wrong* with those platforms; all we know is that
our resume path doesn't work. It's likely that it fails on other
platforms we haven't heard about, too.
Bjorn
The best guess from Mika and David was a firmware issue, but I run the
same Firmware revision as Werner. I even reflashed the Firmware, but
that did not change anything:
Quoting David Box:
> I agree that we should pursue an exception for your system. This is
> looking like a firmware bug. One thing we did notice in the turbostat
> results is your IRTL (Interrupt Response Time Limit) values are bogus:
>
> cpu6: MSR_PKGC3_IRTL: 0x0000884e (valid, 79872 ns)
> cpu6: MSR_PKGC6_IRTL: 0x00008000 (valid, 0 ns)
> cpu6: MSR_PKGC7_IRTL: 0x00008000 (valid, 0 ns)
> cpu6: MSR_PKGC8_IRTL: 0x00008000 (valid, 0 ns)
> cpu6: MSR_PKGC9_IRTL: 0x00008000 (valid, 0 ns)
> cpu6: MSR_PKGC10_IRTL: 0x00008000 (valid, 0 ns)
>
> This is despite the PKGC configuration register showing that all
> states are enabled:
>
> cpu6: MSR_PKG_CST_CONFIG_CONTROL: 0x1e008008 (UNdemote-C3,
UNdemote-C1, demote-
C3, demote-C1, locked, pkg-cstate-limit=8 (unlimited))
>
> Firmware sets this.
Since I can't currently flash modified firmware on this computer, can
those values be overridden from userspace?