On Mon, Oct 07, 2024 at 04:49:19PM +0000, Wassenberg, Dennis wrote: > > The unplug event happens at the top of the hierarchy (below the Root Port). > > So pci_bus_add_devices() binds the Root Port, its driver starts stopping > > and removing the hierarchy below, all the while pci_bus_add_devices() > > continues binding drivers to the child devices. > > > > Could you try this patch (in addition to the one below and to the one > > I sent yesterday): > > > > https://lore.kernel.org/all/20241003084342.27501-1-brgl@xxxxxxxx/ > > > > It should prevent pci_bus_add_devices() from racing with pciehp stopping > > and removing devices. > > I checked the combination of all 3 patches as well. In the end it behaves > the same like if I apply the first patch only (the one you sent the day > before). Thanks a lot for testing and the detailed feedback. Would it be possible for you to try the above-linked patch alone (on top of a recent stock kernel), i.e. without the refcounting fix that you say was sufficient to avoid the UAF? And I'd also appreciate if you could try the match_driver approach ... https://lore.kernel.org/all/Zv-dIHDXNNYomG2Y@xxxxxxxxx/ ... alone, i.e. without any other patches. It's interesting that the refcounting fix was sufficient to avoid the UAF but I can't get over the fact that the pcieport driver is unbound from pci_remove_bus_device(), when it should no longer be bound in the first place. My impression is that teardown of the hierarchy by pciehp races with driver binding after the initial root bus scan, so we probably should try to avoid that. I'd like to confirm (or disprove) that hunch. The refcounting fix could be applied as a safety net but normally shouldn't be necessary if driver unbinding happens in pci_stop_dev() and the device remains unbound afterwards. The match_driver patch should achieve that. And the other patch by Bartosz (linked above) should achieve the same by serializing driver binding after bus enumeration with driver unbinding by pciehp. Finally, I'd appreciate if you could send me dmesg output with the refcounting fix applied. As said before, the MTL Thunderbolt controller claims that the link and slot presence bits are cleared, so it de-enumerates everything attached via Thunderbolt. I'm wondering if it then re-enumerates the Thunderbolt-attached devices so they're actually usable? I'm hoping Mika can clarify with Intel Thunderbolt CoE whether this is a hardware issue in MTL that can e.g. be fixed through a firmware or BIOS update. Thanks! Lukas