On Mon, Aug 21, 2023 at 12:39:02PM +0200, Kamil Paral wrote: > = Summary = > A Thinkpad T480s laptop with a Thinkpad Thunderbolt 3 Dock connected > can no longer resume from suspend. The problem was introduced in > e8b908146d44 "PCI/PM: Increase wait time after resume". > > = Detailed description = > When running a kernel containing the identified commit and trying to > resume the laptop from sleep, the laptop's power light changes from > blinking (sleep state) to shining (running state), but the display > stays black and it doesn't respond to any keyboard input, nor to > ping/ssh, and no logs are written to the disk (which means I don't > know how to gather error logs). It needs to be force-rebooted. I > bisected the kernel and identified the commit which causes this > behavior. I used the vanilla kernel with a Fedora kernel config. > > The reproducer is: > 1. Connect the dock to the laptop. > 2. Boot the laptop (in my case, to the gdm). > 3. Suspend the laptop. > 4. Resume the laptop. This is successful before the identified commit > (the last tested good commit was cc8a983d0fce), and unsuccessful > (black screen, frozen system) after the identified commit > (e8b908146d44). > > The reproducibility is 100%, I tested it many many times in a row. > > When the dock is unplugged, suspend and resume works as expected. When > I connect a different laptop to the dock (Thinkpad P1 gen3), I don't > see any resume failure. So this is somehow related to the particular > combination of Thinkpad T480s and Thinkpad Thunderbolt 3 Dock. The > dock is running the latest firmware. Thanks very much for the report. Wow, this is super interesting. e8b908146d44 literally just increases a timeout; the complete patch is: static void pci_pm_bridge_power_up_actions(struct pci_dev *pci_dev) { - pci_bridge_wait_for_secondary_bus(pci_dev, "resume", PCI_RESET_WAIT); + pci_bridge_wait_for_secondary_bus(pci_dev, "resume", + PCIE_RESET_READY_POLL_MS); Increasing a timeout should never cause a failure like this, so there must be something really unexpected going on. > I also tested "pcie_aspm=off", > and that allows the laptop to resume properly, but then there's a race > condition whether devices on the dock are visible to the OS or not > after resume, so this is not useful even just as a workaround. Wow, also shocking. I see you have lspci output attached to the RH bugzilla, but it doesn't include the details. Would you mind collecting the output of "sudo lspci -vv" both with and without "pcie_aspm=off"? No need to try suspend/resume to collect these. Also, what does this race condition look like? Dock devices are visible before suspend, but sometimes none of them are visible *after* resume? We don't re-enumerate on resume, so does this mean they still appear in lspci output but they just don't work? If you can collect the complete dmesg log and "sudo lspci -vv" output after a resume when the dock devices aren't visible, maybe there would be a clue there. > I already created a downstream Fedora bug report in Red Hat Bugzilla: > https://bugzilla.redhat.com/show_bug.cgi?id=2230357 > > lspci of the laptop: > https://bugzilla-attachments.redhat.com/attachment.cgi?id=1982541 > git bisect log: > https://bugzilla-attachments.redhat.com/attachment.cgi?id=1983351 > > > The commit which broke resume is the following: > > e8b908146d44310473e43b3382eca126e12d279c is the first bad commit > commit e8b908146d44310473e43b3382eca126e12d279c > Author: Mika Westerberg <mika.westerberg.com> > Date: Tue Apr 4 08:27:13 2023 +0300 > > PCI/PM: Increase wait time after resume > > PCIe r6.0 sec 6.6.1 prescribes that a device must be able to respond to > config requests within 1.0 s (PCI_RESET_WAIT) after exiting conventional > reset and this same delay is prescribed when coming out of D3cold (as that > involves reset too). > > A device that requires more than 1 second to initialize after reset may > respond to config requests with Request Retry Status completions (sec > 2.3.1), and we accommodate that in Linux with a 60 second cap > (PCIE_RESET_READY_POLL_MS). > > Previously we waited up to PCIE_RESET_READY_POLL_MS only in the reset code > path, not in the resume path. However, a device has surfaced, namely Intel > Titan Ridge xHCI, which requires a longer delay also in the resume code > path. > > Make the resume code path to use this same extended delay as the reset > path. > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=216728 > Link: https://lore.kernel.org/r/20230404052714.51315-2-mika.westerberg@xxxxxxxxxxxxxxx > Reported-by: Chris Chiu <chris.chiu> > Signed-off-by: Mika Westerberg <mika.westerberg.com> > Signed-off-by: Bjorn Helgaas <bhelgaas> > Cc: Lukas Wunner <lukas> > > drivers/pci/pci-driver.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > > I'm happy to add further details, perform additional debugging, or > test some experimental patches in order to resolve this regression. > Please CC me in your replies, I'm not subscribed to this list. Thank > you! > Kamil Páral > > > #regzbot introduced: e8b908146d44 >