On Thu, Sep 06, 2018 at 06:26:34PM +0200, Lukas Wunner wrote: > I asked for it to be dropped because while it fixes the problem, > it's not a good solution, it would just add technical debt to the > code base and come back to haunt us later. > > Xiongfeng Wang found a similar race a few months ago that wouldn't > be fixed by my patch: > https://patchwork.ozlabs.org/patch/877835/ > > The PCI core nicely separates unbinding of drivers from destruction > of the pci_dev (pci_stop_bus_device() + pci_remove_bus_device()). > Problem is we currently protect both steps with pci_lock_rescan_remove(). > We should only be protecting the second step. The first step (unbinding) > needs to run lockless. > > As a start, I've moved the call to pcie_aspm_exit_link_state() out > of pci_stop_dev(). This also fixes an ASPM bug. Bjorn merged it > the day before yesterday: > https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=pci/aspm&id=1f3934b1d5e5 > > The call to device_release_driver() already uses locking internally > and can be run lockless. The calls to pci_proc_detach_device() and > pci_remove_sysfs_dev_files() need to be amended with locking, I've > got some preliminary patches for this on my development branch that > I'll have to rework. This is very hairy, historically grown code > that requires great care to avoid breakage. It'll take a little > more time I'm afraid. OK, thanks for the explanation. I see you have a good plan going forward with this. I can reproduce the issue easily so can assist in testing if needed.