Hi, On Wed, Oct 09, 2024 at 10:01:18PM -0500, Aaron Rainbolt wrote: > We're experiencing a Linux kernel bug affecting multiple Clevo X370SNx1 > laptops (specifically the X370SNW1 variant). The bug appears to be > present in kernels greater than or equal to 6.5, worsening > significantly with kernel 6.11.2 (latest stable at time of this > writing). It is unclear if all of the issues encountered are the same > bug, however the primary problem we've run into appears to be a > consequence of the power management code involving Intel Barlow Ridge > controllers and DisplayPort. The issue occurs with in-kernel Nouveau > drivers and also with proprietary NVIDIA drivers. > > When a DisplayPort monitor is attached to these laptops via a USB-C > connection, the monitor is recognized by the system and comes on for > approximately 15 seconds. It then blanks out and is automatically > disconnected from the system as if it had been unplugged. It will > remain that way indefinitely until unplugged and replugged, or until > something "jiggles" (for lack of a better term) the thunderbolt driver. > When either of these things occur, the display will re-attach and come > back on for 15 seconds, then blank out and detach again. There are > various different things that can "jiggle" the thunderbolt driver, > including but not limited to: > > * Running `lspci -k` (this one came as a particular surprise) > * Removing and re-inserting the thunderbolt driver (`sudo modprobe -r > thunderbolt; sleep 1; sudo modprobe thunderbolt`) > * Running `nvidia-detector` while proprietary NVIDIA drivers are loaded Or just disabling runtime PM, I presume. > It is possible to mitigate this issue by simply running > `sudo modprobe -r thunderbolt` or `sudo rmmod thunderbolt` and then > leaving the driver unloaded. USB-C displays become stable after this - > they are recognized when attached and remain recognized and functional > indefinitely as one would expect. > > We believe this is related to the Intel Barlow Ridge USB4 controller > because: > > * Removing the thunderbolt driver restores normal display operation. > * This issue was *not* a problem on Clevo X370SNx machines, which are > identical to the X370SNx1 except for the Maple Ridge TBT controller > on the board has been replaced with a Barlow Ridge USB4 controller. > * This problem does not occur on the affected models with the 6.1 > kernel. It occurs with the 6.5 kernel and on all newer kernels we > have tried. > > Furthermore, from inspecting the Thunderbolt driver code, we believe > this is related to the power management features of the driver, because: > > * There is only one 15-second timeout defined in the driver source > code, that being TB_AUTOSUSPEND_DELAY in drivers/thunderbolt/tb.h > * On earlier kernels (Ubuntu’s variant of 6.8 at least), displays are > stable even when the thunderbolt driver is loaded if we: > * Remove the thunderbolt driver > * Attach a USB-C dock > * Attach displays to the dock (we used 2 4K HDMI monitors) > * Reload the thunderbolt driver > > During our investigation, we discovered commit > a75e0684efe567ae5f6a8e91a8360c4c1773cf3a (patch on mailing list at > https://lore.kernel.org/linux-usb/20240213114318.3023150-1-mika.westerberg@xxxxxxxxxxxxxxx/) > which appears to be a fix for this exact problem. It adds a quirk for > Intel Barlow Ridge controllers, which detects when a DisplayPort device > has been plugged directly into the USB4 port (thus using "redrive" > mode), and instructs the power management subsystem to not power the > chip down during this time if so. Unfortunately, this quirk seems to be > silently ignored, as we built a custom kernel with some `printk` lines > added to the `tb_enter_redrive` and `tb_exit_redrive` functions to > announce when they were called, and nothing in the dmesg log indicated > that they had been called when we did this. > > This bug is easily reproducible using the stock kernels in Kubuntu > 22.04, Kubuntu 24.04, Kali Linux 2024.2, and Fedora Workstation > Rawhide. Similar behavior is observed across all of these distributions. > > We built the 6.11.2 kernel from source and tested it on Kubuntu 24.04, > but while the kernel built, installed, and functioned properly in most > respects, it actually made the problem with USB-C displays worse. As > long as the thunderbolt driver was loaded, no displays were detected > when plugged in (not for even a short length of time), and when the > thunderbolt driver was unloaded, displays would only be recognized and > function if there was only one display attached. Attaching a second > display resulted in the first external display becoming detached and > the second display not coming on. Unplugging the second display > resulted in the first display reattaching. This machine supports up to > three external displays and this has proven to be achievable and stable > with earlier kernels. No valuable error messages were logged in dmesg > when these problems occurred. > > Our testing has been limited to the Clevo X370SNW1 model, however we > expect that the X370SNV1 model will exhibit the same issues as it uses > very similar internal components on the system board. > > This is basically the extent of our knowledge at this point. We > attempted various patches on Ubuntu's 6.8 kernel to resolve the issue, > all without success: > > * We attempted reverting fd4d58d1fef9ae9b0ee235eaad73d2e0a6a73025 > (thunderbolt: Enable CL2 low power state), which had no effect. > * We noticed that one of the Barlow Ridge bridge controllers > listed by `lspci -k` appeared to not have its device ID in > drivers/thunderbolt/nhi.h and there was a corresponding quirk in > drivers/thunderbolt/quirks.c that looked like it might be vaguely > related to the issue (specifically quirk_usb3_maximum_bandwidth), so > we tried adding that device to the appropriate files in order to make > that quirk apply to that device as well, this had no visible effect > on the kernel's operation and did not resolve the issue. > * After narrowing it down to `quirk_block_rpm_in_redrive`, we attempted > adding a new `thunderbolt.kf_force_redrive` kernel parameter in > drivers/thunderbolt/tb.c that forced the code in > `tb_enter_redrive` and `tb_exit_redrive` to be executed even *if* the > device didn't have the appropriate quirk bit set, in the hopes that > this would make the quirk execute and resolve the issue. What ended > up happening was somehow `tb_enter_redrive` was never called at all > and `tb_exit_redrive` was called. This in turn made it so that no > USB-C displays would even be recognized for a short period of time if > the thunderbolt driver was loaded. > * Looking at PCI vendor IDs, we noticed that the PCI vendor ID used to > recognize all Intel controllers in drivers/thunderbolt/quirks.c was > 0x8087, whereas the Barlow Ridge controller in our device reported a > vendor ID of 0x8086. On the off chance that this was a typo of epic > proportions, we tried adjusting all of the occurrences of 0x8087 in > the tb_quirks[] array to PCI_VENDOR_ID_INTEL (which is defined as > 0x8086 in include/linux/pci_ids.h). This has no visible effect on the > kernel's behavior, and did not resolve the issue. (Presumably there's > something going on with the IDs there that we're not aware of.) > > As to my speculation as to what's wrong, I believe this is likely a > combination of two things: > > * Some data in the `tb_quirks` array in drivers/thunderbolt/quirks.c is > incorrect and leading to the Barlow Ridge controllers not being > recognized as needing the DisplayPort redrive mode quirk. > * The code in drivers/thunderbolt/tb.c `tb_dp_resource_unavailable` > that controls whether or not to run `tb_enter_redrive` is faulty in > some way and is not calling `tb_enter_redrive` in all scenarios where > it is necessary. To be clear, the exact code I'm talking about is > this chunk from the aforementioned function: > > tunnel = tb_find_tunnel(tb, TB_TUNNEL_DP, in, out); > if (tunnel) > tb_deactivate_and_free_tunnel(tunnel); > else > tb_enter_redrive(port); > > Finally, this is probably a result of me misreading the driver code > somehow, but I was surprised by the following conditional at the top > of `tb_enter_redrive`: > > if (!(sw->quirks & QUIRK_KEEP_POWER_IN_DP_REDRIVE)) > return; > > To me this reads as "if the DP redrive quirk bit is set, return and do > nothing. Otherwise, if the bit is not set, run the quirk function." There is the "return;" which reads that if the quirk is not set, return from this function early. > This is the opposite of what I would expect - shouldn't the code run if > the bit is set, not if it is clear? Or does the bit being unset mean > that the quirk is active? (I do not believe that this is the root cause > of the issue because even when I forced this function to run any time > it was invoked, it wasn't being invoked at all.) Okay, thanks for the very detailed report. We need bit more information to investigate this. The commit you referred is exactly for this purpose and I'm surprised it did not work but also the Barlow Ridge PCI IDs are suprised too, as if this would have some old firmware or something. Can you share full dmesg with the repro and "thunderbolt.dyndbg=+p" in the kernel command line?