Re: [REGRESSION] resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Sep 24, 2023 at 03:18:00PM -0500, Bjorn Helgaas wrote:
> On Sun, Sep 24, 2023 at 04:27:22PM +0300, Mika Westerberg wrote:
> > On Sat, Sep 23, 2023 at 05:46:10PM -0500, Bjorn Helgaas wrote:
> > > On Fri, Aug 25, 2023 at 10:42:55AM +0200, Kamil Paral wrote:
> > > > On Thu, Aug 24, 2023 at 1:43 PM Mika Westerberg
> > > > <mika.westerberg@xxxxxxxxxxxxxxx> wrote:
> > > > > One thing I noticed, probably has nothing to do with this, but you have
> > > > > the "security level" set to "secure". Now this is fine and actually
> > > > > recommended but I wonder if anything changes if you switch that
> > > > > temporarily to "user"? What is happening here is that once the system
> > > > > enters S3 the Thunderbolt driver tells the firmware to save the
> > > > > connected device list, and then once it exits S3 it is expected to
> > > > > re-connect the PCIe tunnels of the devices on that list but this is not
> > > > > happening and that's why the dock "dissappears" during resume.
> > > > 
> > > > That was a great suggestion. After switching to the user security
> > > > level, the resume delay is gone, and my dock devices seem to be
> > > > working almost immediately after resume! The dmesg for that is here:
> > > > https://bugzilla-attachments.redhat.com/attachment.cgi?id=1985262
> > > > 
> > > > I've done tens of cycles and haven't found any race conditions, unlike
> > > > with the TB assist mode. (Only once, my USB mouse wasn't working at
> > > > all, but that's something that occasionally happens on most docks I've
> > > > worked with and seems to be some different issue).
> > > > 
> > > > I'm sorry I haven't found this earlier myself. I did try switching
> > > > these options, but I bundled it together with enabling the TB assist
> > > > mode, which has quirks, so I didn't realize switching just this one
> > > > option might have an impact.
> > > > 
> > > > > In any case we can conclude that the commit in question has nothing to
> > > > > do with the issue. This is completely Thunderbolt related problem.
> > > > 
> > > > Considering the information above, does this appear to be a solely
> > > > dock-related issue (bugged firmware), or does it make sense to follow
> > > > up on this in some different kernel list? I have to say I'm completely
> > > > OK with running the laptop using the "user" TB security level, but if
> > > > you think I should follow up somewhere to get the "secure" level fixed
> > > > (or some workaround applied, etc), I can.
> > > 
> > > I'm confused about this issue.  Correct me if I go wrong:
> > > 
> > > The hierarchy is:
> > > 
> > >   00:1c.4 Root Port to [bus 04-3c]
> > >   04:00.0 Upstream Port (Thunderbolt) to [bus 05-3c]
> > >   05:01.0 Downstream Port (Thunderbolt) to [bus 07-3b]
> > >   07:00.0 Upstream Port (Thunderbolt) to [bus 08-3b]
> > > 
> > > With security level=secure, before e8b908146d44 ("PCI/PM: Increase
> > > wait time after resume"), resume takes ~5 seconds, but the hierarchy
> > > below 05:01.0 gets removed and re-enumerated (dmesg [1]).  After
> > > e8b908146d44, the same thing happens except the resume takes 60+
> > > seconds (dmesg [2]).  In both cases, the devices (USB mouse, LAN, etc)
> > > below 05:01.0 work after resume.
> > > 
> > > With security level=user, resume takes << 5 seconds regardless of
> > > e8b908146d44, and the hierarchy below 05:01.0 does not get removed and
> > > re-enumerated (dmesg [3]).
> > > 
> > > So if that's all accurate, it sounds like we've always had some
> > > problem with security level=secure that causes the hierarchy to get
> > > removed and re-enumerated, and e8b908146d44 just makes this problem
> > > much more visible?
> > 
> > Yes.
> > 
> > > I don't know anything at all about how Thunderbolt security levels
> > > work.  If "secure" means the hierarchy must be re-enumerated after
> > > resume, we can detect that case immediately and get on with it without
> > > having to wait for a timeout?
> > 
> > "secure" means that the Thunderbolt device that is connected can "prove"
> > it is the device we "authorized". It basically has a random number we
> > generated flashed on the NVM. This is the security "measure" used before
> > PC world aligned to use IOMMU instead.
> > 
> > (there is an explanation of all these here:
> > https://docs.kernel.org/admin-guide/thunderbolt.html#security-levels-and-how-to-use-them).
> 
> So is there some user-level software that runs between the removal and
> re-enumeration?  Something that authorizes the 07:00.0 Upstream Port?

No.

They get "authorized" upon plug by boltd based on user decision and
after that the firmware should keep them authorized as long as the
device is connected, including also resume.

> > Now, in case of resume the Thunderbolt firmware is expected to connect
> > the PCIe tunnel before the OS gets to resume its PCIe stack but this is
> > not happening in this particular system when the security level is set
> > to "secure". It could be firmware issue, and also if the BIOS settings
> > get changed from the defaults it is entirely possible that the system
> > enters paths that are not fully validated. Yes, changing security level
> > should definitely work and the PCIe tunnel should be properly
> > established but in any case this is Thunderbolt issue not PCIe issue.
> > 
> > I don't recall if I suggested this already but if not, try to see there
> > is a firmware update for your system. Lenovo supports LVFS so if there
> > is newer one fwupd should allow you to upgrade it.
> 
> I think you have suggested a firmware update, but I don't think that's
> a great solution for most users.  An ordinary user who has the
> security level set to "secure" and updates to a v6.4 kernel is just
> going to think resume is broken.  That user will not be willing or
> able to diagnose it as a security setting that could be changed or
> firmware that could be updated.

It does not affect every single system there with security level set to
"secure". It is just this one AFAICT. Like I said the firmware is
expected to connect the PCIe tunnel (and for some unknown reason in this
particular system it does not).

> It would be ideal if we could make "secure" resume as fast as "user"
> resume, but at the very least, I think we need to make it no worse
> than it was in v6.3.

Well what else can we do here? The link goes down regardless what we do
in the PCI stack. If you want to revert the patch that caused the delay,
fine but that does not cure the problem tha thet device stack get torn
down upon resume. For instance if there is USB stick connected to the
dock and the filesystem is mounted, all that is gone upon resume
regardless of the delay. If you want to detect the "secure" vs. "user"
fine feel free to do so but keep in mind that there are other systems
out there where this works just fine so avoid breaking them.



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux