On Fri, 2024-07-26 at 09:19 +0900, Damien Le Moal wrote: > On 7/25/24 21:07, Philipp Stanner wrote: > > pci_intx() is a function that becomes managed if > > pcim_enable_device() > > has been called in advance. Commit 25216afc9db5 ("PCI: Add managed > > pcim_intx()") changed this behavior so that pci_intx() always leads > > to > > creation of a separate device resource for itself, whereas earlier, > > a > > shared resource was used for all PCI devres operations. > > > > Unfortunately, pci_intx() seems to be used in some drivers' > > remove() > > paths; in the managed case this causes a device resource to be > > created > > on driver detach. > > > > Fix the regression by only redirecting pci_intx() to its managed > > twin > > pcim_intx() if the pci_command changes. > > > > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()") > > Reported-by: Damien Le Moal <dlemoal@xxxxxxxxxx> > > Closes: > > https://lore.kernel.org/all/b8f4ba97-84fc-4b7e-ba1a-99de2d9f0118@xxxxxxxxxx/ > > Signed-off-by: Philipp Stanner <pstanner@xxxxxxxxxx> > > --- > > Alright, I reproduced this with QEMU as Damien described and this > > here > > fixes the issue on my side. Feedback welcome. Thank you very much, > > Damien. > > This works ans is cleaner that what I had :) The fundamental idea is mostly identical to yours – you likely just didn't see it because your attention was directed towards the code in devres.c ;) > > Tested-by: Damien Le Moal <dlemoal@xxxxxxxxxx> You might wanna ping Bjorn about that in case he didn't see. > > > It seems that this might yet again be the issue of drivers not > > being > > aware that pci_intx() might become managed, so they use it in their > > unwind path (rightfully so; there probably was no alternative back > > then). > > At least for the ahci driver with wich I found the issue, what is odd > is that > there is only a single call to pci_intx() There is only a single _directly visible_ call :] > to *enable* intx, and that call is in > the probe path. With QEMU, this call is not even done as the qemu > AHCI support MSI. Hmm, MSI... > > Adding a WARN_ON(!enable) at the beginning of pci_inx(), we can see > that what > happens is that during device probe, we get this backtrace: > > [ 34.658988] WARNING: CPU: 0 PID: 999 at drivers/pci/pci.c:4480 > pci_intx+0x7f/0xc0 > [ 34.660799] Modules linked in: ahci(+) rpcsec_gss_krb5 auth_rpcgss > nfsv4 > dns_resolver nfs lockd grace netfs] > [ 34.673784] CPU: 0 UID: 0 PID: 999 Comm: modprobe Tainted: > G W > 6.10.0+ #1948 > [ 34.675961] Tainted: [W]=WARN > [ 34.676891] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), > BIOS > 1.16.3-2.fc40 04/01/2014 > [ 34.679197] RIP: 0010:pci_intx+0x7f/0xc0 > [ 34.680348] Code: b7 d2 be 04 00 00 00 48 89 df e8 0c 84 ff ff 48 > 8b 44 24 08 > 65 48 2b 04 25 28 00 00 00 756 > [ 34.685015] RSP: 0018:ffffb60f40e2f7f0 EFLAGS: 00010246 > [ 34.686436] RAX: 0000000000000000 RBX: ffff9dbb81237000 RCX: > ffffb60f40e2f64c > [ 34.688294] RDX: 0000000000000000 RSI: 0000000000000000 RDI: > ffff9dbb81237000 > [ 34.690120] RBP: 0000000000000000 R08: 0000000000000000 R09: > 0000000000000001 > [ 34.691986] R10: ffff9dbb88883538 R11: 0000000000000001 R12: > 0000000000000001 > [ 34.693687] R13: ffff9dbb812370c8 R14: ffff9dbb86eeaa00 R15: > 0000000000000000 > [ 34.695140] FS: 00007f9d81465740(0000) GS:ffff9dbcf7c00000(0000) > knlGS:0000000000000000 > [ 34.696884] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 34.698107] CR2: 00007ffc786ed8b8 CR3: 00000001088da000 CR4: > 0000000000350ef0 > [ 34.699562] Call Trace: > [ 34.700215] <TASK> > [ 34.700802] ? pci_intx+0x7f/0xc0 > [ 34.701607] ? __warn.cold+0xa5/0x13c > [ 34.702448] ? pci_intx+0x7f/0xc0 > [ 34.703257] ? report_bug+0xff/0x140 > [ 34.704105] ? handle_bug+0x3a/0x70 > [ 34.704938] ? exc_invalid_op+0x17/0x70 > [ 34.705826] ? asm_exc_invalid_op+0x1a/0x20 > [ 34.706593] ? pci_intx+0x7f/0xc0 > [ 34.707086] msi_capability_init+0x35a/0x370 > [ 34.707723] __pci_enable_msi_range+0x187/0x240 > [ 34.708356] pci_alloc_irq_vectors_affinity+0xc4/0x110 > [ 34.709058] ahci_init_one+0x6ec/0xcc0 [ahci] > [ 34.709692] ? __pm_runtime_resume+0x58/0x90 > [ 34.710311] local_pci_probe+0x45/0x90 > [ 34.710865] pci_device_probe+0xbb/0x230 > [ 34.711433] really_probe+0xcc/0x350 > [ 34.711976] ? pm_runtime_barrier+0x54/0x90 > [ 34.712569] ? __pfx___driver_attach+0x10/0x10 > [ 34.713206] __driver_probe_device+0x78/0x110 > [ 34.713837] driver_probe_device+0x1f/0xa0 > [ 34.714427] __driver_attach+0xbe/0x1d0 > [ 34.715001] bus_for_each_dev+0x92/0xe0 > [ 34.715563] bus_add_driver+0x115/0x200 > [ 34.716136] driver_register+0x72/0xd0 > [ 34.716704] ? __pfx_ahci_pci_driver_init+0x10/0x10 [ahci] > [ 34.717457] do_one_initcall+0x76/0x3a0 > [ 34.718036] do_init_module+0x60/0x210 > [ 34.718616] init_module_from_file+0x86/0xc0 > [ 34.719243] idempotent_init_module+0x127/0x2c0 > [ 34.719913] __x64_sys_finit_module+0x5e/0xb0 > [ 34.720546] do_syscall_64+0x7d/0x160 > [ 34.721100] ? srso_return_thunk+0x5/0x5f > [ 34.721695] ? do_syscall_64+0x89/0x160 > [ 34.722258] ? srso_return_thunk+0x5/0x5f > [ 34.722846] ? do_sys_openat2+0x9c/0xe0 > [ 34.723421] ? srso_return_thunk+0x5/0x5f > [ 34.724012] ? syscall_exit_to_user_mode+0x64/0x1f0 > [ 34.724703] ? srso_return_thunk+0x5/0x5f > [ 34.725293] ? do_syscall_64+0x89/0x160 > [ 34.725883] ? srso_return_thunk+0x5/0x5f > [ 34.726467] ? syscall_exit_to_user_mode+0x64/0x1f0 > [ 34.727159] ? srso_return_thunk+0x5/0x5f > [ 34.727764] ? do_syscall_64+0x89/0x160 > [ 34.728341] ? srso_return_thunk+0x5/0x5f > [ 34.728937] ? exc_page_fault+0x6c/0x200 > [ 34.729511] ? srso_return_thunk+0x5/0x5f > [ 34.730109] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 34.730837] RIP: 0033:0x7f9d80d281dd > [ 34.731375] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e > fa 48 89 f8 > 48 89 f7 48 89 d6 48 89 ca 4d8 > [ 34.733796] RSP: 002b:00007ffc786f0898 EFLAGS: 00000246 ORIG_RAX: > 0000000000000139 > [ 34.734894] RAX: ffffffffffffffda RBX: 00005617347f09a0 RCX: > 00007f9d80d281dd > [ 34.735850] RDX: 0000000000000000 RSI: 0000561715fe5e79 RDI: > 0000000000000003 > [ 34.736812] RBP: 00007ffc786f0950 R08: 00007f9d80df6b20 R09: > 00007ffc786f08e0 > [ 34.737769] R10: 00005617347f13e0 R11: 0000000000000246 R12: > 0000561715fe5e79 > [ 34.738725] R13: 0000000000040000 R14: 00005617347f8990 R15: > 00005617347f8b20 > [ 34.739695] </TASK> > [ 34.740075] ---[ end trace 0000000000000000 ]--- > > So it is msi_capability_init() that is the problem: that function > calls > pci_intx_for_msi(dev, 0) which then calls pci_intx(dev, 0), thus > creating the > intx devres for the device despite the driver code not touching intx > at all. Nope, that is not the problem – as you correctly point out below, that device resource should be destroyed again. > The driver is fine ! It is MSI touching INTX that is messing things > up. Yes, many drivers are more or less innocent in regards with all of that. As Christoph rightfully pointed out, an API should never behave like that and do _implicit_ magic behind your back. That's the entire philosophy of the C Language. > > That said, I do not see that as an issue in itself. What I fail to > understand > though is why that intx devres is not deleted on device teardown. I > think this > may have something to do with the fact that pcim_intx() always does > "res->orig_intx = !enable;", that is, it assumes that if there is a > call to > pcim_intx(dev, 0), then it is because intx where enabled already, > which I do not > think is true for most drivers... So we endup with INTX being wrongly > enabled on > device teardown by pcim_intx_restore(), and because of that, the intx > resource > is not deleted ? Spoiler: The device resources that have initially been created do get deleted. Devres works as intended. The issue is that the forces of evil invoke pci_intx() through another path, hidden behind an API, through another devres callback. So the device resource never gets deleated because it is *created* on driver detach, when devres already ran. > > Re-enabling intx on teardown is wrong I think, but that still does > not explain > why that resource is not deleted. I fail to see why. You came very close to the truth ;) With some help from my favorite coworker I did some tracing today and found this when doing `rmmod ahci`: => pci_intx => pci_msi_shutdown => pci_disable_msi => devres_release_all => device_unbind_cleanup => device_release_driver_internal => driver_detach => bus_remove_driver => pci_unregister_driver => __do_sys_delete_module => do_syscall_64 => entry_SYSCALL_64_after_hwframe The SYSCALL is my `rmmod`. As you can see, pci_intx() is invoked indirectly through pci_disable_msi() – which gets invoked by devres, which is precisely one reason why you could not find the suspicious pci_intx() call in the ahci code base. Now the question is: Who set up that devres callback which indirectly calls pci_intx()? It is indeed MSI, here in msi/msi.c: static void pcim_msi_release(void *pcidev) { struct pci_dev *dev = pcidev; dev->is_msi_managed = false; pci_free_irq_vectors(dev); // <-- calls pci_disable_msi(), which calls pci_intx(), which re-registers yet another devres callback } /* * Needs to be separate from pcim_release to prevent an ordering problem ==> Oh, here they even had a warning about that interacting with devres somehow... * vs. msi_device_data_release() in the MSI core code. */ static int pcim_setup_msi_release(struct pci_dev *dev) { int ret; if (!pci_is_managed(dev) || dev->is_msi_managed) return 0; ret = devm_add_action(&dev->dev, pcim_msi_release, dev); if (ret) return ret; dev->is_msi_managed = true; return 0; } I don't know enough about AHCI to see where exactly it jumps into these, but a candidate would be: * pci_enable_msi(), called among others in acard-ahci.c Another path is: 1. ahci_init_one() calls 2. ahci_init_msi() calls 3. pci_alloc_irq_vectors() calls 4. pci_alloc_irq_vectors_affinity() calls 5. __pci_enable_msi_range() OR __pci_enable_msix_range() call 6. pci_setup_msi_context() calls 7. pcim_setup_msi_release() which registers the callback to pci_intx() Ha! I think I earned myself a Friday evening beer now 8-) Now the interesting question will be how the heck we're supposed to clean that up. Another interesting question is: Did that only work by coincidence during the last 15 years, or is it by design that the check in pci_intx(): if (new != pci_command) only evaluates to true if we are not in a detach path. If it were coincidence, it would not have caused faults as it did now with my recent work, because the old version did not allocate in pci_intx(). But it could certainly have been racy and might run into a UAF since the old pci_intx() would have worked on memory that is also managed by devres, but has been registered at a different place. I guess that is what that comment in the MSI code quoted above is hinting at. Christoph indeed rightfully called it voodoo ^^ Cheers, P.