On Wed, Sep 27, 2023 at 09:02:00PM +0300, Mika Westerberg wrote: > On Wed, Sep 27, 2023 at 12:24:55PM -0500, Bjorn Helgaas wrote: > > On Wed, Sep 27, 2023 at 08:01:14PM +0300, Mika Westerberg wrote: > > > On Wed, Sep 27, 2023 at 11:50:36AM -0500, Bjorn Helgaas wrote: > > > ... > > > > The hierarchy: > > > > > > > > pci 0000:00:1c.4: PCI bridge to [bus 04-3c] > > > > pci 0000:04:00.0: PCI bridge to [bus 05-3c] > > > > pci 0000:05:00.0: PCI bridge to [bus 06] > > > > pci 0000:05:01.0: PCI bridge to [bus 07-3b] > > > > > > > > It looks like we start the 06:00.0 resume first (118.9), but it > > > > doesn't complete until after the timeout (191.7): > > > > > > > > [ 118.915870] thunderbolt 0000:06:00.0: control channel starting... > > > > [ 118.985530] pcieport 0000:05:01.0: Data Link Layer Link Active not set in 1000 msec > > > > [ 190.090902] pcieport 0000:05:01.0: pciehp: Slot(1): Card not present > > > > [ 191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1 > > > > [ 191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630 > > > > [ 191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock > > > > [ 191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present > > > > > > > > Did the Thunderbolt driver do something to 06:00.0 that caused the > > > > 05:01.0 link to come up, or is the timing just coincidental? > > > > > > Yes it sent the firmware a command telling that the driver is ready > > > again, then the firmware sends back notification that there is a new > > > device: > > > > > > [ 191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1 > > > [ 191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630 > > > [ 191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock > > > > > > this then is send to the userspace via uevent where bolt goes and > > > authorizes it and this results the tunnel to be created which show in > > > the log as: > > > > > > [ 191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present > > > > So the obvious next question is why we have to wait for the 05:01.0 > > link timeout before sending the command to the 06:00.0 firmware, since > > there's no PCI connection between them. > > No there is not - it is a PCIe tunnel. You confirmed my assertion that in terms of the PCI topology, there is no connection between 05:01.0 and 06:00.0. So the question remains: do we need to wait for the 05:01.0 link timeout before the 06:00.0 driver does something that leads to firmware starting the process of bringing up the tunnel and the link? > > But there must be *some* connection between the 05:01.0 link coming up > > and the 06:00.0 behavior. Maybe this is related to the > > nhi_resume_noirq() comment about "The tunneled pci bridges are > > siblings of us. Use resume_noirq to reenable the tunnels asap. A > > corresponding pci quirk blocks the downstream bridges resume_noirq > > until we are done." Unfortunately the comment doesn't mention the NAME > > of the quirk, so I lost the trail there. > > > > Maybe there's an opportunity for a quirk that says "this Thunderbolt > > device should never need a whole second for the link to come up", for > > example. > > The PCIe stack needs to follow PCIe spec (regardless of what is the > actual medium the packets go through). And regadless what we do here the > PCIe link goes down and that breaks the usage of any devices behind that > link such as mounted disks so the delay is definitely not the worst > thing that can happen. Sure. But if we resume in a reasonable time, we at least have the opportunity to notice what happened and warn about why the mounted disk broke. That's way better than getting bug reports that say "resume is broken." > Now, If you read anything that Kamil said, this actually works now with > the "secure" mode when he turned the "Wake from Thunderbolt" option back > to the default. So now it works as expected but upon unplug there is the > issue that we fixed with the commit that marks the devices disconnected. > He will test that next week so we can be sure. I did see that. You may have also seen my sentiment that it's painful for users to diagnose an apparent resume failure and find a BIOS option to fix it. If we can't make resume work in a reasonable time, can we at least detect the case where it's going to take too long? If so, maybe we can warn about it, avoid suspend, or change the way we do it. Bjorn