Re: [Regression] [PCI/ASPM] [ASUS PN51] Reboot on resume attempt (bisect done; commit found)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 5 Jan 2024, Kai-Heng Feng wrote:

> On Wed, Jan 3, 2024 at 11:41 PM Ilpo Järvinen
> <ilpo.jarvinen@xxxxxxxxxxxxxxx> wrote:
> >
> > On Mon, 1 Jan 2024, Bjorn Helgaas wrote:
> >
> > > On Mon, Dec 25, 2023 at 07:29:02PM +0100, Michael Schaller wrote:
> > > > Issue:
> > > > On resume from suspend to RAM there is no output for about 12 seconds, then
> > > > shortly a blinking cursor is visible in the upper left corner on an
> > > > otherwise black screen which is followed by a reboot.
> > > >
> > > > Setup:
> > > > * Machine: ASUS mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1)
> > > > * Firmware: 0508 (latest; also tested previous 0505)
> > > > * OS: Ubuntu 23.10 (except kernel)
> > > > * Kernel: 6.6.8 (also tested 6.7-rc7; config attached)
> > > >
> > > > Debugging summary:
> > > > * Kernel 5.10.205 isn’t affected.
> > > > * Bisect identified commit 08d0cc5f34265d1a1e3031f319f594bd1970976c as
> > > > cause.
> > > > * PCI device 0000:03:00.0 (Intel 8265 Wifi) causes resume issues as long as
> > > > ASPM is enabled (default).
> > > > * The commit message indicates that a quirk could be written to mitigate the
> > > > issue but I don’t know how to write such a quirk.
> > > >
> > > > Confirmed workarounds:
> > > > * Connect a USB flash drive (no clue why; maybe this causes a delay that
> > > > lets the resume succeed)
> > > > * Revert commit 08d0cc5f34265d1a1e3031f319f594bd1970976c (commit seemed
> > > > intentional; a quirk seems to be the preferred solution)
> > > > * pcie_aspm=off
> > > > * pcie_aspm.policy=performance
> > > > * echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm
> > > >
> > > > Debugging details:
> > > > * The resume trigger (power button, keyboard, mouse) doesn’t seem to make
> > > > any difference.
> > > > * Double checked that the kernel is configured to *not* reboot on panic.
> > > > * Double checked that there still isn't any kernel output without quiet and
> > > > splash.
> > > > * The issue doesn’t happen if a USB flash drive is connected. The content of
> > > > the flash drive doesn’t appear to matter. The USB port doesn’t appear to
> > > > matter.
> > > > * No information in any logs after the reboot. I suspect the resume from
> > > > suspend to RAM isn’t getting far enough as that logs could be written.
> > > > * Kernel 5.10.205 isn’t affected. Kernel 5.15.145, 6.6.8 and 6.7-rc7 are
> > > > affected.
> > > > * A kernel bisect has revealed the following commit as cause:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=08d0cc5f34265d1a1e3031f319f594bd1970976c
> > > > * The commit was part of kernel 5.20 and has been backported to 5.15.
> > > > * The commit mentions that a device-specific quirk could be added in case of
> > > > new issues.
> > > > * According to sysfs and lspci only device 0000:03:00.0 (Intel 8265 Wifi)
> > > > has ASPM enabled by default.
> > > > * Disabling ASPM for device 0000:03:00.0 lets the resume from suspend to RAM
> > > > succeed.
> > > > * Enabling ASPM for all devices except 0000:03:00.0 lets the resume from
> > > > suspend to RAM succeed.
> > > > * This would indicate that a quirk is missing for the device 0000:03:00.0
> > > > (Intel 8265 Wifi) but I have no clue how to write such a quirk or how to get
> > > > the specifics for such a quirk.
> > > > * I still have no clue how a USB flash drive plays into all this. Maybe some
> > > > kind of a timing issue where the connected USB flash drive delays something
> > > > long enough so that the resume succeeds. Maybe the code removed by commit
> > > > 08d0cc5f34265d1a1e3031f319f594bd1970976c caused a similar delay. ¯\_(ツ)_/¯
> > >
> > > Hmmm.  08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> > > appeared in v6.0, released Oct 2, 2022, so it's been there a while.
> > >
> > > But I think the best option is to revert it until this issue is
> > > resolved.  Per the commit log, 08d0cc5f3426 solved two problems:
> > >
> > >   1) ASPM config changes done via sysfs are lost if the device power
> > >      state is changed, e.g., typically set to D3hot in .suspend() and
> > >      D0 in .resume().
> > >
> > >   2) If L1SS is restored during system resume, that restored state
> > >      would be overwritten.
> > >
> > > Problem 2) relates to a patch that is currently reverted (a7152be79b62
> > > ("Revert "PCI/ASPM: Save L1 PM Substates Capability for
> > > suspend/resume""), so I don't think reverting 08d0cc5f3426 will make
> > > this problem worse.
> > >
> > > Reverting 08d0cc5f3426 will make 1) a problem again.  But my guess is
> > > ASPM changes via sysfs are fairly unusual and the device probably
> > > remains functional even though it may use more power because the ASPM
> > > configuration was lost.
> > >
> > > So unless somebody has a counter-argument, I plan to queue a revert of
> > > 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") for
> > > v6.7.
> >
> > Hi,
> >
> > I cannot understand how 1) even occurs. AFAICT, nothing
> > pcie_aspm_pm_state_change() calls into overwrites link->aspm_disable that
> > is the variable storing user inputs via sysfs. So how the changes via
> > sysfs are lost?
> 
> Because it's states being enabled via sysfs get overwritten, not the
> disabled ones.

This leaves me even less sure what you're even talking about here. Are we
talking about aspm_attr_store_common() which "enables" states by changing 
link->aspm_disable? (But aspm_attr_store_common() just as much "disables" 
states by altering link->aspm_disable so I don't see how there's 
difference between enabled/disabled ones).

During pcie_aspm_pm_state_change(), pcie_config_aspm_link() then uses 
link->aspm_capable and link->aspm_disable as input but it won't change 
link->aspm_disable (= it won't overwrite the user's input).

pcie_update_aspm_capable() done before calling pcie_config_aspm_path() can 
alter link->aspm_capable (looks very much intentional) which can lead into 
some state not being available any more for pcie_config_aspm_link().
Is this what you are trying to say, that some state gets removed from 
link->aspm_capable and the effective result is that a state user enabled
via sysfs can no longer be enabled?

I even looked into aspm.c from 08d0cc5f3426 but cannot see significant 
differences on how link->aspm_disable is being handled vs current code.

-- 
 i.

[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux