On Tue, Jun 05, 2018 at 04:18:26PM -0600, Keith Busch wrote: > On Tue, Jun 05, 2018 at 05:09:11PM -0500, Bjorn Helgaas wrote: > > > @@ -796,10 +796,10 @@ void aer_isr(struct work_struct *work) > > > struct aer_rpc *rpc = container_of(work, struct aer_rpc, dpc_handler); > > > struct aer_err_source uninitialized_var(e_src); > > > > > > - mutex_lock(&rpc->rpc_mutex); > > > + pci_lock_rescan_remove(); > > > while (get_e_source(rpc, &e_src)) > > > aer_isr_one_error(rpc, &e_src); > > > - mutex_unlock(&rpc->rpc_mutex); > > > + pci_unlock_rescan_remove(); > > > > I think this needs to be updated after Oza's patches, doesn't it? > > > > It looks like this would deadlock if I applied it to my current "next" > > branch as-is: > > > > aer_isr > > pci_lock_rescan_remove > > aer_isr_one_error > > aer_process_err_devices > > handle_error_source > > pcie_do_fatal_recovery > > pci_lock_rescan_remove <-- deadlock > > > > > aer_release(rpc); > > > } > > Yes, looks like you are right about that. > > I fully intended to have this rebased on that by now, but nvme issues > took way more time than I anticipated. Things appear to have calmed down > on that front, and I should be able to rebase appropriately this week > (famous last words...). No worries, it's doubtful that I can still squeeze it into v4.18, so I think it's better if we target this for v4.19. I have some questions unrelated to the rebase: - What does the use-after-free look like to a user? Panic, corruption, etc? It's nice if we have breadcrumbs that connect a symptom to the fix. - I'm not super comfortable with the AER tree traversal in find_source_device(). Obviously this has always been there and is not your issue. But it's dubious when a driver for device X also peeks at devices Y, Z, etc. That always leads to locking issues. - I'm not clear on how pci_bus_sem and pci_rescan_remove_lock should be used. pci_bus_sem is acquired by pci_walk_bus() and a few other PCI core paths. pci_rescan_remove_lock is acquired (via pci_lock_rescan_remove()) by hotplug drivers and a few other add/remove/rescan paths. Most users of pci_walk_bus() do not use pci_lock_rescan_remove(), so it's not clear to me how to decide whether they *all* should, or why this AER path is different from the ones that don't need to. Bjorn