Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Sep 27, 2023 at 12:24:55PM -0500, Bjorn Helgaas wrote:
> On Wed, Sep 27, 2023 at 08:01:14PM +0300, Mika Westerberg wrote:
> > On Wed, Sep 27, 2023 at 11:50:36AM -0500, Bjorn Helgaas wrote:
> > > On Wed, Sep 27, 2023 at 03:47:32PM +0300, Mika Westerberg wrote:
> > > > On Wed, Sep 27, 2023 at 06:57:03AM -0500, Bjorn Helgaas wrote:
> > > > > On Wed, Sep 27, 2023 at 08:16:02AM +0300, Mika Westerberg wrote:
> > > > > > On Tue, Sep 26, 2023 at 12:55:30PM -0500, Bjorn Helgaas wrote:
> > > > > > > On Mon, Sep 25, 2023 at 04:19:30PM +0200, Lukas Wunner wrote:
> > > > > > > > On Mon, Sep 25, 2023 at 08:48:41AM -0500, Bjorn Helgaas wrote:
> > > > > > > > > Now pciehp thinks the slot is occupied and the link is up, so we
> > > > > > > > > re-enumerate the hierarchy.  Is this because thunderbolt did something
> > > > > > > > > to 06:00.0 that made the link from 05:01.0 come up?
> > > > > > > > 
> > > > > > > > PCIe TLPs are encapsulated into Thunderbolt packets and transmitted
> > > > > > > > alongside DisplayPort and other data over the same physical link.
> > > > > > > > 
> > > > > > > > For this to work, PCIe tunnels need to be set up between the Thunderbolt
> > > > > > > > host controller and attached devices.  Once a tunnel is established,
> > > > > > > > the PCIe link magically goes up and TLPs can be transmitted.
> > > > > > > > 
> > > > > > > > There are two ways to establish those tunnels:
> > > > > > > > 
> > > > > > > > 1/ By a firmware in the Thunderbolt host controller.
> > > > > > > >    (firmware or "internal" connection manager, drivers/thunderbolt/icm.c)
> > > > > > > > 
> > > > > > > > 2/ Natively by the kernel.
> > > > > > > >    (software connection manager)
> > > > > > > > 
> > > > > > > > I'm assuming that the laptop in question exclusively uses the firmware
> > > > > > > > connection manager, hence the kernel is reliant on that firmware to
> > > > > > > > establish tunnels and can't really do anything if it fails to do so.
> > > > > > > 
> > > > > > > Thanks for the background; that improves my meager understanding a
> > > > > > > lot.
> > > > > > > 
> > > > > > > Since this seems to be a firmware issue, it does sound like this
> > > > > > > laptop uses a firmware connection manager.  But there still seems to
> > > > > > > be some kernel connection because pre-e8b908146d44, the link came up
> > > > > > > in <5 seconds, and after the minor e8b908146d44 change, it takes >60
> > > > > > > seconds.
> > > > > > 
> > > > > > In both cases (with or without) the commit what happens is that after
> > > > > > resume is finished the firmware connection manager notices the
> > > > > > connection, announces it to the Thunderbolt driver that exposes it to
> > > > > > the userspace where boltd re-authorizes the device. This brings up the
> > > > > > PCIe tunnel again and things get working.
> > > > > > 
> > > > > > (What is expected to happen is that during the resume the firmware
> > > > > >  connection manager re-connects the PCIe tunnel.)
> > > > > > 
> > > > > > This took previously the ~5s before resume is complete so that the above
> > > > > > steps can happen where as after the commit it got delayed more up to the
> > > > > > arbitrary ~60s because we started to use that with the commit
> > > > > > e8b908146d44 (PCIE_RESET_READY_POLL_MS).
> > > > > 
> > > > > Why does the kernel delay affect the timing of when the firmware
> > > > > connection manager notices the connection?  It seems like Linux waits
> > > > > for the timeout, then Linux does something that kicks the firmware
> > > > > connection manager.  That's why I asked about this sequence:
> > > > > 
> > > > >   [  118.985530] pcieport 0000:05:01.0: Data Link Layer Link Active not set in 1000 msec
> > > > >   [  190.090902] pcieport 0000:05:01.0: pciehp: Slot(1): Card not present
> > > > >   [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
> > > > >   [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
> > > > >   [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
> > > > >   [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present
> > > > > 
> > > > > where we wait for the timeout, decide the device is gone, remove
> > > > > everything, and then the thunderbolt driver does something, and we
> > > > > notice the device is magically back.
> > > > 
> > > > Well the delay delays the whole resume and this includes Thunderbolt
> > > > driver resume too, and userspace (where the bolt daemon authorizes the
> > > > device again).
> > > 
> > > I don't know how the Thunderbolt driver works.  I assume this refers
> > > to "thunderbolt 0000:06:00.0"?  Is the 06:00.0 resume related to the
> > > firmware connection manager somehow?
> > > 
> > > The removal affects the sub-hierarchy below 05:01.0 (bus 07-3b).
> > > 06:00.0 is below 05:00.0, so it's in a different sub-hierarchy.  I
> > > don't think there's a PCIe requirement that 05:01.0 be resumed before
> > > 05:00.0, or even that they be serialized at all.
> > > 
> > > The hierarchy:
> > > 
> > >   pci 0000:00:1c.4: PCI bridge to [bus 04-3c]
> > >   pci 0000:04:00.0: PCI bridge to [bus 05-3c]
> > >   pci 0000:05:00.0: PCI bridge to [bus 06]
> > >   pci 0000:05:01.0: PCI bridge to [bus 07-3b]
> > > 
> > > It looks like we start the 06:00.0 resume first (118.9), but it
> > > doesn't complete until after the timeout (191.7):
> > > 
> > >   [  118.915870] thunderbolt 0000:06:00.0: control channel starting...
> > >   [  118.985530] pcieport 0000:05:01.0: Data Link Layer Link Active not set in 1000 msec
> > >   [  190.090902] pcieport 0000:05:01.0: pciehp: Slot(1): Card not present
> > >   [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
> > >   [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
> > >   [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
> > >   [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present
> > > 
> > > Did the Thunderbolt driver do something to 06:00.0 that caused the
> > > 05:01.0 link to come up, or is the timing just coincidental?
> > 
> > Yes it sent the firmware a command telling that the driver is ready
> > again, then the firmware sends back notification that there is a new
> > device:
> > 
> > [  191.754347] thunderbolt 0000:06:00.0: 1: DROM version: 1
> > [  191.762638] thunderbolt 0-1: new device found, vendor=0x108 device=0x1630
> > [  191.762641] thunderbolt 0-1: Lenovo ThinkPad Thunderbolt 3 Dock
> > 
> > this then is send to the userspace via uevent where bolt goes and
> > authorizes it and this results the tunnel to be created which show in
> > the log as:
> > 
> > [  191.943506] pcieport 0000:05:01.0: pciehp: Slot(1): Card present
> 
> So the obvious next question is why we have to wait for the 05:01.0
> link timeout before sending the command to the 06:00.0 firmware, since
> there's no PCI connection between them.

No there is not - it is a PCIe tunnel.

> But there must be *some* connection between the 05:01.0 link coming up
> and the 06:00.0 behavior.  Maybe this is related to the
> nhi_resume_noirq() comment about "The tunneled pci bridges are
> siblings of us. Use resume_noirq to reenable the tunnels asap. A
> corresponding pci quirk blocks the downstream bridges resume_noirq
> until we are done." Unfortunately the comment doesn't mention the NAME
> of the quirk, so I lost the trail there.
> 
> Maybe there's an opportunity for a quirk that says "this Thunderbolt
> device should never need a whole second for the link to come up", for
> example.

The PCIe stack needs to follow PCIe spec (regardless of what is the
actual medium the packets go through). And regadless what we do here the
PCIe link goes down and that breaks the usage of any devices behind that
link such as mounted disks so the delay is definitely not the worst
thing that can happen.

Now, If you read anything that Kamil said, this actually works now with
the "secure" mode when he turned the "Wake from Thunderbolt" option back
to the default. So now it works as expected but upon unplug there is the
issue that we fixed with the commit that marks the devices disconnected.
He will test that next week so we can be sure.



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux