We're experiencing a Linux kernel bug affecting multiple Clevo X370SNx1 laptops (specifically the X370SNW1 variant). The bug appears to be present in kernels greater than or equal to 6.5, worsening significantly with kernel 6.11.2 (latest stable at time of this writing). It is unclear if all of the issues encountered are the same bug, however the primary problem we've run into appears to be a consequence of the power management code involving Intel Barlow Ridge controllers and DisplayPort. The issue occurs with in-kernel Nouveau drivers and also with proprietary NVIDIA drivers. When a DisplayPort monitor is attached to these laptops via a USB-C connection, the monitor is recognized by the system and comes on for approximately 15 seconds. It then blanks out and is automatically disconnected from the system as if it had been unplugged. It will remain that way indefinitely until unplugged and replugged, or until something "jiggles" (for lack of a better term) the thunderbolt driver. When either of these things occur, the display will re-attach and come back on for 15 seconds, then blank out and detach again. There are various different things that can "jiggle" the thunderbolt driver, including but not limited to: * Running `lspci -k` (this one came as a particular surprise) * Removing and re-inserting the thunderbolt driver (`sudo modprobe -r thunderbolt; sleep 1; sudo modprobe thunderbolt`) * Running `nvidia-detector` while proprietary NVIDIA drivers are loaded It is possible to mitigate this issue by simply running `sudo modprobe -r thunderbolt` or `sudo rmmod thunderbolt` and then leaving the driver unloaded. USB-C displays become stable after this - they are recognized when attached and remain recognized and functional indefinitely as one would expect. We believe this is related to the Intel Barlow Ridge USB4 controller because: * Removing the thunderbolt driver restores normal display operation. * This issue was *not* a problem on Clevo X370SNx machines, which are identical to the X370SNx1 except for the Maple Ridge TBT controller on the board has been replaced with a Barlow Ridge USB4 controller. * This problem does not occur on the affected models with the 6.1 kernel. It occurs with the 6.5 kernel and on all newer kernels we have tried. Furthermore, from inspecting the Thunderbolt driver code, we believe this is related to the power management features of the driver, because: * There is only one 15-second timeout defined in the driver source code, that being TB_AUTOSUSPEND_DELAY in drivers/thunderbolt/tb.h * On earlier kernels (Ubuntu’s variant of 6.8 at least), displays are stable even when the thunderbolt driver is loaded if we: * Remove the thunderbolt driver * Attach a USB-C dock * Attach displays to the dock (we used 2 4K HDMI monitors) * Reload the thunderbolt driver During our investigation, we discovered commit a75e0684efe567ae5f6a8e91a8360c4c1773cf3a (patch on mailing list at https://lore.kernel.org/linux-usb/20240213114318.3023150-1-mika.westerberg@xxxxxxxxxxxxxxx/) which appears to be a fix for this exact problem. It adds a quirk for Intel Barlow Ridge controllers, which detects when a DisplayPort device has been plugged directly into the USB4 port (thus using "redrive" mode), and instructs the power management subsystem to not power the chip down during this time if so. Unfortunately, this quirk seems to be silently ignored, as we built a custom kernel with some `printk` lines added to the `tb_enter_redrive` and `tb_exit_redrive` functions to announce when they were called, and nothing in the dmesg log indicated that they had been called when we did this. This bug is easily reproducible using the stock kernels in Kubuntu 22.04, Kubuntu 24.04, Kali Linux 2024.2, and Fedora Workstation Rawhide. Similar behavior is observed across all of these distributions. We built the 6.11.2 kernel from source and tested it on Kubuntu 24.04, but while the kernel built, installed, and functioned properly in most respects, it actually made the problem with USB-C displays worse. As long as the thunderbolt driver was loaded, no displays were detected when plugged in (not for even a short length of time), and when the thunderbolt driver was unloaded, displays would only be recognized and function if there was only one display attached. Attaching a second display resulted in the first external display becoming detached and the second display not coming on. Unplugging the second display resulted in the first display reattaching. This machine supports up to three external displays and this has proven to be achievable and stable with earlier kernels. No valuable error messages were logged in dmesg when these problems occurred. Our testing has been limited to the Clevo X370SNW1 model, however we expect that the X370SNV1 model will exhibit the same issues as it uses very similar internal components on the system board. This is basically the extent of our knowledge at this point. We attempted various patches on Ubuntu's 6.8 kernel to resolve the issue, all without success: * We attempted reverting fd4d58d1fef9ae9b0ee235eaad73d2e0a6a73025 (thunderbolt: Enable CL2 low power state), which had no effect. * We noticed that one of the Barlow Ridge bridge controllers listed by `lspci -k` appeared to not have its device ID in drivers/thunderbolt/nhi.h and there was a corresponding quirk in drivers/thunderbolt/quirks.c that looked like it might be vaguely related to the issue (specifically quirk_usb3_maximum_bandwidth), so we tried adding that device to the appropriate files in order to make that quirk apply to that device as well, this had no visible effect on the kernel's operation and did not resolve the issue. * After narrowing it down to `quirk_block_rpm_in_redrive`, we attempted adding a new `thunderbolt.kf_force_redrive` kernel parameter in drivers/thunderbolt/tb.c that forced the code in `tb_enter_redrive` and `tb_exit_redrive` to be executed even *if* the device didn't have the appropriate quirk bit set, in the hopes that this would make the quirk execute and resolve the issue. What ended up happening was somehow `tb_enter_redrive` was never called at all and `tb_exit_redrive` was called. This in turn made it so that no USB-C displays would even be recognized for a short period of time if the thunderbolt driver was loaded. * Looking at PCI vendor IDs, we noticed that the PCI vendor ID used to recognize all Intel controllers in drivers/thunderbolt/quirks.c was 0x8087, whereas the Barlow Ridge controller in our device reported a vendor ID of 0x8086. On the off chance that this was a typo of epic proportions, we tried adjusting all of the occurrences of 0x8087 in the tb_quirks[] array to PCI_VENDOR_ID_INTEL (which is defined as 0x8086 in include/linux/pci_ids.h). This has no visible effect on the kernel's behavior, and did not resolve the issue. (Presumably there's something going on with the IDs there that we're not aware of.) As to my speculation as to what's wrong, I believe this is likely a combination of two things: * Some data in the `tb_quirks` array in drivers/thunderbolt/quirks.c is incorrect and leading to the Barlow Ridge controllers not being recognized as needing the DisplayPort redrive mode quirk. * The code in drivers/thunderbolt/tb.c `tb_dp_resource_unavailable` that controls whether or not to run `tb_enter_redrive` is faulty in some way and is not calling `tb_enter_redrive` in all scenarios where it is necessary. To be clear, the exact code I'm talking about is this chunk from the aforementioned function: tunnel = tb_find_tunnel(tb, TB_TUNNEL_DP, in, out); if (tunnel) tb_deactivate_and_free_tunnel(tunnel); else tb_enter_redrive(port); Finally, this is probably a result of me misreading the driver code somehow, but I was surprised by the following conditional at the top of `tb_enter_redrive`: if (!(sw->quirks & QUIRK_KEEP_POWER_IN_DP_REDRIVE)) return; To me this reads as "if the DP redrive quirk bit is set, return and do nothing. Otherwise, if the bit is not set, run the quirk function." This is the opposite of what I would expect - shouldn't the code run if the bit is set, not if it is clear? Or does the bit being unset mean that the quirk is active? (I do not believe that this is the root cause of the issue because even when I forced this function to run any time it was invoked, it wasn't being invoked at all.) This issue has only been definitively reproduced on already-EOL kernels due to the (potentially related) problem encountered with 6.11.2. However based on a code comparison it appears all of the apparently relevant code (that which deals with the DP quirk) is identical between Ubuntu's variation of the 6.8 kernel and the tip of the mainline master branch. Therefore I believe this issue very likely impacts the latest mainline kernel.