On Tue, Sep 26, 2023 at 12:55:30PM -0500, Bjorn Helgaas wrote: > On Mon, Sep 25, 2023 at 04:19:30PM +0200, Lukas Wunner wrote: > > On Mon, Sep 25, 2023 at 08:48:41AM -0500, Bjorn Helgaas wrote: > > > Now pciehp thinks the slot is occupied and the link is up, so we > > > re-enumerate the hierarchy. Is this because thunderbolt did something > > > to 06:00.0 that made the link from 05:01.0 come up? > > > > PCIe TLPs are encapsulated into Thunderbolt packets and transmitted > > alongside DisplayPort and other data over the same physical link. > > > > For this to work, PCIe tunnels need to be set up between the Thunderbolt > > host controller and attached devices. Once a tunnel is established, > > the PCIe link magically goes up and TLPs can be transmitted. > > > > There are two ways to establish those tunnels: > > > > 1/ By a firmware in the Thunderbolt host controller. > > (firmware or "internal" connection manager, drivers/thunderbolt/icm.c) > > > > 2/ Natively by the kernel. > > (software connection manager) > > > > I'm assuming that the laptop in question exclusively uses the firmware > > connection manager, hence the kernel is reliant on that firmware to > > establish tunnels and can't really do anything if it fails to do so. > > Thanks for the background; that improves my meager understanding a > lot. > > Since this seems to be a firmware issue, it does sound like this > laptop uses a firmware connection manager. But there still seems to > be some kernel connection because pre-e8b908146d44, the link came up > in <5 seconds, and after the minor e8b908146d44 change, it takes >60 > seconds. In both cases (with or without) the commit what happens is that after resume is finished the firmware connection manager notices the connection, announces it to the Thunderbolt driver that exposes it to the userspace where boltd re-authorizes the device. This brings up the PCIe tunnel again and things get working. (What is expected to happen is that during the resume the firmware connection manager re-connects the PCIe tunnel.) This took previously the ~5s before resume is complete so that the above steps can happen where as after the commit it got delayed more up to the arbitrary ~60s because we started to use that with the commit e8b908146d44 (PCIE_RESET_READY_POLL_MS). > I'm kind of at a loss here because I don't have a clear path forward. > What I'm hearing is that the real fix is a firmware update or a BIOS > setting change (Thunderbolt "user" instead of "secure" mode). There are lots of firmares involved so, say if any of them are turned from the default value the system may enter code paths that are not fully validated unfortunately. I would also try to change all the BIOS settings back to defaults, see that it works (it is probably in "user" security level then), then switch back to "secure" (only change this one option) and try if it now works. It could be that some setting just did not get commited properly. > That is problematic for users, who will think resume got broken and > they don't know how to fix it. It's problematic for me, because it > *looks* like a PCI issue and a PCI change exposed it, so I'll have to > deal with the reports. I'm sorry about that. Trying best I can to remedy this.