Hi, thank you for your additional mail. On Fri, 2024-10-04 at 09:45 +0200, Lukas Wunner wrote: > On Thu, Oct 03, 2024 at 03:46:55PM +0200, Lukas Wunner wrote: > > On Wed, Sep 25, 2024 at 03:38:34PM +0000, Wassenberg, Dennis wrote: > > > [ 2.858063] Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] > > > PREEMPT SMP NOPTI > > > [ 2.858071] CPU: 13 UID: 0 PID: 137 Comm: irq/156-pciehp Not tainted 6.11.0-devel+ #3 > > > [ 2.858090] Hardware name: LENOVO 21LVS1CV00/21LVS1CV00, BIOS N45ET18W (1.08 ) 07/08/2024 > > > [ 2.858097] RIP: 0010:dev_driver_string+0x12/0x40 > > > [ 2.858111] Code: 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 > > > 00 00 48 8b 47 68 48 85 c0 74 08 <48> 8b 00 c3 cc cc cc cc 48 8b 47 60 48 85 c0 75 ef 48 8b 97 a8 02 > > > [ 2.858123] RSP: 0000:ffff9493009cfa00 EFLAGS: 00010202 > > > [ 2.858132] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8e53029cb918 RCX: 0000000000000000 > > > [ 2.858139] RDX: ffffffffa586b18a RSI: ffff8e53029cb918 RDI: ffff8e53029cb918 > > > [ 2.858144] RBP: ffff9493009cfb10 R08: 0000000000000000 R09: ffff8e5304f61000 > > > [ 2.858150] R10: ffff9493009cfb20 R11: 0000000000005627 R12: ffffffffa64db188 > > > [ 2.858156] R13: 6b6b6b6b6b6b6b6b R14: 0000000000000080 R15: ffff8e5302b1c0c0 > > > [ 2.858161] FS: 0000000000000000(0000) GS:ffff8e5a50140000(0000) knlGS:0000000000000000 > > > [ 2.858169] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > [ 2.858175] CR2: 0000000000000000 CR3: 000000030162e001 CR4: 0000000000f70ef0 > > > [ 2.858182] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > [ 2.858187] DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 > > > [ 2.858193] PKRU: 55555554 > > > [ 2.858196] Call Trace: > > > > [...] > > > [ 2.858258] __dynamic_dev_dbg+0x170/0x210 > > > [ 2.858287] pci_destroy_slot+0x59/0x60 > > > [ 2.858296] pciehp_remove+0x2e/0x50 > > > [ 2.858304] pcie_port_remove_service+0x30/0x50 > > > [ 2.858311] device_release_driver_internal+0x19f/0x200 > > > [ 2.858322] bus_remove_device+0xc6/0x130 > > > [ 2.858335] device_del+0x165/0x3f0 > > > [ 2.858348] device_unregister+0x17/0x60 > > > [ 2.858355] remove_iter+0x1f/0x30 > > > [ 2.858361] device_for_each_child+0x6a/0xb0 > > > [ 2.858368] pcie_portdrv_remove+0x2f/0x60 > > > [ 2.858374] pci_device_remove+0x3f/0xa0 > > > [ 2.858383] device_release_driver_internal+0x19f/0x200 > > > [ 2.858392] bus_remove_device+0xc6/0x130 > > > [ 2.858398] device_del+0x165/0x3f0 > > > [ 2.858413] pci_remove_bus_device+0x91/0x140 > > > [ 2.858422] pci_remove_bus_device+0x3e/0x140 > > > [ 2.858430] pciehp_unconfigure_device+0x98/0x160 > > > [ 2.858439] pciehp_disable_slot+0x69/0x130 > > > [ 2.858447] pciehp_handle_presence_or_link_change+0x281/0x4c0 > > > [ 2.858456] pciehp_ist+0x14a/0x150 > > > > Could you try the patch below and report back if it fixes the issue? > > So the patch I sent you yesterday was based on the insight that > in the stack trace above, pci_destroy_slot() accesses slot->bus > after bus has been destroyed. Yes, the results are looking very good like I mentioned in the mail before. > > The slot is not a child of bus, so doesn't implicitly hold a reference > on bus. The only kobjects holding references on bus are the child > devices of the bus, and they're destroyed before the bus gets destroyed. > The logical conclusion was that slot needs to hold a reference on bus > to avoid a use-after-free. Ok, yes. This explains why this happens more often by adding/remove a pci bus hierarchy in comparison to a single device. > > However looking at the stacktrace with a fresh pair of eyeballs, > I notice there's an oddity here: The kernel normally unbinds drivers > from PCI devices in pci_stop_dev(). The actual removal of the devices > happens in a separate step, pci_remove_bus_device(). > > I note that in your stacktrace, driver unbinding happens from > pci_remove_bus_device(). That's unexpected. The driver should > have already been unbound earlier in pci_stop_dev(). > > Perhaps the pcieport driver was rebound to the device after it was > already unbound. The patch below should prevent that. > > There are no messages though that would prove rebinding of a hotplug port. > > The unplug event happens at the top of the hierarchy (below the Root Port). > So pci_bus_add_devices() binds the Root Port, its driver starts stopping > and removing the hierarchy below, all the while pci_bus_add_devices() > continues binding drivers to the child devices. > > Could you try this patch (in addition to the one below and to the one > I sent yesterday): > > https://lore.kernel.org/all/20241003084342.27501-1-brgl@xxxxxxxx/ > > It should prevent pci_bus_add_devices() from racing with pciehp stopping > and removing devices. I checked the combination of all 3 patches as well. In the end it behaves the same like if I apply the first patch only (the one you sent the day before). All 3 sounds plausible to me but in my testing environment it is sufficient to apply the first one only in order to get this UAF fixed. For me this is something which should be integrated into the stock kernel to make users of Lenovo Thunderbolt 4 docks in combination with a meteor lake based device happy ;) > > BTW, are you using a stock kernel or do you have any custom patches > (such as grsecurity) applied on top which might lead to a different > behavior than a stock kernel? Initially I had custom patches applied (e.g. grsecurity) and we ran into this issue. Before reporting it here I double checked this using a stock kernel. There, no additional patches where applied (just need to add slab_debug in the command line in order to make this UAF visible (like grsecurity did it on our side). In my tests I did the first iteration using additional patches as well (our default system/behavior). This is because I have KASAN enabled meanwhile and the issue is/was still 100% reproducible. Using a stock kernel it failed in round about 50% of the tries (KASAN disabled, only set 'slab_debug'). After enabling KASAN it was not reproducible any more. Thus, I took the system where the reproducibility was even better. After this I double checked the results using a stock kernel. This one behaves exactly the same like the modified kernel we are using (after applying the patches). Thank you & best regards, Dennis > > Thanks, > > Lukas > > -- >8 -- > > diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c > index 387db92..1b38d79 100644 > --- a/drivers/pci/remove.c > +++ b/drivers/pci/remove.c > @@ -36,6 +36,7 @@ static void pci_stop_dev(struct pci_dev *dev) > if (pci_dev_is_added(dev)) { > device_for_each_child(dev->dev.parent, dev_of_node(&dev->dev), > pci_pwrctl_unregister); > + dev->match_driver = false; > device_release_driver(&dev->dev); > pci_proc_detach_device(dev); > pci_remove_sysfs_dev_files(dev);