Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Aug 21, 2023 at 12:39:02PM +0200, Kamil Paral wrote:
> = Summary =
> A Thinkpad T480s laptop with a Thinkpad Thunderbolt 3 Dock connected
> can no longer resume from suspend. The problem was introduced in
> e8b908146d44 "PCI/PM: Increase wait time after resume".
> 
> = Detailed description =
> When running a kernel containing the identified commit and trying to
> resume the laptop from sleep, the laptop's power light changes from
> blinking (sleep state) to shining (running state), but the display
> stays black and it doesn't respond to any keyboard input, nor to
> ping/ssh, and no logs are written to the disk (which means I don't
> know how to gather error logs). It needs to be force-rebooted. I
> bisected the kernel and identified the commit which causes this
> behavior. I used the vanilla kernel with a Fedora kernel config.
> 
> The reproducer is:
> 1. Connect the dock to the laptop.
> 2. Boot the laptop (in my case, to the gdm).
> 3. Suspend the laptop.
> 4. Resume the laptop. This is successful before the identified commit
> (the last tested good commit was cc8a983d0fce), and unsuccessful
> (black screen, frozen system) after the identified commit
> (e8b908146d44).
> 
> The reproducibility is 100%, I tested it many many times in a row.
> 
> When the dock is unplugged, suspend and resume works as expected. When
> I connect a different laptop to the dock (Thinkpad P1 gen3), I don't
> see any resume failure. So this is somehow related to the particular
> combination of Thinkpad T480s and Thinkpad Thunderbolt 3 Dock. The
> dock is running the latest firmware.

Thanks very much for the report.

Wow, this is super interesting.  e8b908146d44 literally just increases
a timeout; the complete patch is:

   static void pci_pm_bridge_power_up_actions(struct pci_dev *pci_dev)
   {
  -       pci_bridge_wait_for_secondary_bus(pci_dev, "resume", PCI_RESET_WAIT);
  +       pci_bridge_wait_for_secondary_bus(pci_dev, "resume",
  +                                         PCIE_RESET_READY_POLL_MS);

Increasing a timeout should never cause a failure like this, so
there must be something really unexpected going on.

> I also tested "pcie_aspm=off",
> and that allows the laptop to resume properly, but then there's a race
> condition whether devices on the dock are visible to the OS or not
> after resume, so this is not useful even just as a workaround.

Wow, also shocking.  I see you have lspci output attached to the RH
bugzilla, but it doesn't include the details.  Would you mind
collecting the output of "sudo lspci -vv" both with and without
"pcie_aspm=off"?  No need to try suspend/resume to collect these.

Also, what does this race condition look like?  Dock devices are
visible before suspend, but sometimes none of them are visible *after*
resume?  We don't re-enumerate on resume, so does this mean they still
appear in lspci output but they just don't work?

If you can collect the complete dmesg log and "sudo lspci -vv" output
after a resume when the dock devices aren't visible, maybe there would
be a clue there.

> I already created a downstream Fedora bug report in Red Hat Bugzilla:
> https://bugzilla.redhat.com/show_bug.cgi?id=2230357
> 
> lspci of the laptop:
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1982541
> git bisect log:
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1983351
> 
> 
> The commit which broke resume is the following:
> 
> e8b908146d44310473e43b3382eca126e12d279c is the first bad commit
> commit e8b908146d44310473e43b3382eca126e12d279c
> Author: Mika Westerberg <mika.westerberg.com>
> Date:   Tue Apr 4 08:27:13 2023 +0300
> 
>     PCI/PM: Increase wait time after resume
> 
>     PCIe r6.0 sec 6.6.1 prescribes that a device must be able to respond to
>     config requests within 1.0 s (PCI_RESET_WAIT) after exiting conventional
>     reset and this same delay is prescribed when coming out of D3cold (as that
>     involves reset too).
> 
>     A device that requires more than 1 second to initialize after reset may
>     respond to config requests with Request Retry Status completions (sec
>     2.3.1), and we accommodate that in Linux with a 60 second cap
>     (PCIE_RESET_READY_POLL_MS).
> 
>     Previously we waited up to PCIE_RESET_READY_POLL_MS only in the reset code
>     path, not in the resume path.  However, a device has surfaced, namely Intel
>     Titan Ridge xHCI, which requires a longer delay also in the resume code
>     path.
> 
>     Make the resume code path to use this same extended delay as the reset
>     path.
> 
>     Link: https://bugzilla.kernel.org/show_bug.cgi?id=216728
>     Link: https://lore.kernel.org/r/20230404052714.51315-2-mika.westerberg@xxxxxxxxxxxxxxx
>     Reported-by: Chris Chiu <chris.chiu>
>     Signed-off-by: Mika Westerberg <mika.westerberg.com>
>     Signed-off-by: Bjorn Helgaas <bhelgaas>
>     Cc: Lukas Wunner <lukas>
> 
>  drivers/pci/pci-driver.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> 
> I'm happy to add further details, perform additional debugging, or
> test some experimental patches in order to resolve this regression.
> Please CC me in your replies, I'm not subscribed to this list. Thank
> you!
> Kamil Páral
> 
> 
> #regzbot introduced: e8b908146d44
> 



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux