On Fri, 2024-12-06 at 20:31 +0100, Niklas Schnelle wrote: > On Fri, 2024-12-06 at 19:12 +0100, Niklas Schnelle wrote: > > On Fri, 2024-10-18 at 17:47 +0300, Ilpo Järvinen wrote: > > > This mostly reverts the commit b4c7d2076b4e ("PCI/LINK: Remove > > > bandwidth notification"). An upcoming commit extends this driver > > > building PCIe bandwidth controller on top of it. > > > > > > The PCIe bandwidth notification were first added in the commit > > > e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth > > > notification") but later had to be removed. The significant changes > > > compared with the old bandwidth notification driver include: > > > > ---8<--- > > > --- > > > > Hi Ilpo, > > > > I bisected a v6.13-rc1 boot hang on my personal workstation to this > > patch. Sadly I don't have much details like a panic or so because the > > boot hangs before any kernel messages, or at least they're not visible > > long enough to see. I haven't yet looked into the code as I wanted to > > raise awareness first. Since the commit doesn't revert cleanly on > > v6.13-rc1 I also haven't tried that yet. > > > > Here are some details on my system: > > - AMD Ryzen 9 3900X > > - ASRock X570 Creator Motherboard > > - Radeon RX 5600 XT > > - Intel JHL7540 Thunderbolt 3 USB Controller (only USB 2 plugged) > > - Intel 82599 10 Gigabit NIC with SR-IOV enabled with 2 VFs > > - Intel n I211 Gigabit NIC > > - Intel Wi-Fi 6 AX200 > > - Aquantia AQtion AQC107 NIC > > > > If you have patches or things to try just ask. > > > > Thanks, > > Niklas > > > > Ok I can now at least confirm that bluntly disabling the new bwctrl > driver with the below diff on top of v6.13-rc1 circumvents the boot > hang I'm seeing. So it's definitely this. > > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c > index 5e10306b6308..6fa54480444a 100644 > --- a/drivers/pci/pcie/portdrv.c > +++ b/drivers/pci/pcie/portdrv.c > @@ -828,7 +828,7 @@ static void __init pcie_init_services(void) > pcie_aer_init(); > pcie_pme_init(); > pcie_dpc_init(); > - pcie_bwctrl_init(); > + /* pcie_bwctrl_init(); */ > pcie_hp_init(); > } > Also here is the full lspci -vvv output running the above on v6.13-rc1: https://paste.js.org/9UwQIMp7eSgp Also note that I have CONFIG_PCIE_THERMAL unset so it's also not the cooling device portion that's causing the issue. Next I guess I should narrow it down to the specific port where enabling the bandwidth monitoring is causing trouble, not yet sure how best to do this with this many devices. Thanks, Niklas