On Mon, Jul 16, 2018 at 12:25:07AM +0530, poza@xxxxxxxxxxxxxx wrote: > On 2018-07-13 03:21, Bjorn Helgaas wrote: > > On Thu, Jul 12, 2018 at 11:33:42AM -0600, Thomas Tai wrote: > > > Hi Bjorn, > > > Thank you very much to review the V1 patch. I reworked the patch as > > > you have suggested. Your suggestion is indeed better and cleaner. > > > Would you please have a look for me? > > > > > > FYI, I did not clear the device structure in aer_isr_one_error() > > > just to avoid over complicated the fix. > > > > I think that was a bad idea anyway. I think the problem there is that > > add_error_device() copies the pci_dev pointer without taking a > > reference on it: > > > > aer_isr > > get_e_source # dequeue from rpc->e_sources[] > > *e_src = rpc->e_sources[x] > > aer_isr_one_error > > pdev = rpc->rpd # pdev = Root Port pci_dev > > find_source_device(pdev) > > find_device_iter > > add_error_device > > e_info->dev[i] = dev # <-- save pci_dev pointer > > aer_process_err_devices > > handle_error_source(e_info->dev[i]) > > pcie_do_fatal_recovery > > > > At this point, we have pci_dev pointers in e_info->dev[] that are > > potentially stale. Right now nobody uses them, but I think it's > > sloppy that we still have them at all. > > > > I think a better way to fix this would be to call pci_dev_get() in > > add_error_device() when we save the pointer, and then do the > > corresponding pci_dev_put() in aer_isr_one_error(), after we return > > from aer_process_err_devices(). > > > > This could be done with a little helper function that iterates through > > e_info->dev[i], calls pci_dev_put() for each, and clears > > e_info->error_dev_num (which could then be removed from > > find_source_device()). > > > > I think this would fix the problem you're seeing, so we wouldn't need > > the change in pcie_do_fatal_recovery(). > > > > However, I think we're also slightly exposed in dpc_work(), in basically > > the same (possibly harmless) way. > > > > dpc_irq > > schedule_work(&dpc->work) > > ... > > dpc_work > > pdev = dpc->dev->port > > pcie_do_fatal_recovery(pdev) > > > > pdev may be removed by pcie_do_fatal_recovery(), but dpc_work() is still > > holding onto a pointer (which it never uses again). > > > > The DPC driver should be holding a reference to pdev (through some black > > magic I don't understand), but that would be released when pdev is > > removed, > > and I don't know what ensures that dpc_work() runs before that release. > > > > Is not that dpc_work() is calling pcie_do_fatal_recovery() which in-turn is > removing pdev. > I do not see DPC is using any more reference beyond that. > So, isnt it safe to assume that no matter when the dpc_work() run, the pdev > it reference is always intact. > this is with the fact, that both AER and DPC can not trigger simultaneously, > and and hence no race for removal. > who else would remove that pdev other than AER and DPC ? It's not a problem for the devices below the contained port. It's more about the port itself, which may happen if you hot-remove the switch that has DPC services: nothing coordinates that switch's removal with the dpc_work so the dpc work could be referencing a pdev that pciehp removed earlier, for example. Removing switches currently is not very common as far as I know, but there are definitely users that want that ability. > With Keith's patch, are not we just improving the latency with threaded irq > instead of workqueue ? It's a little more than that. The irq subsystem will sync with the threaded irq before allowing teardown of the device to complete, so that should ensure we don't have a use-after-free in the dpc bottom half. > the potential problem is another interrupt arriving while we have > acknowledged it in dpc_reset_link() and we are about to remove the devices. > The acknowledgement of DPC status trigger and re-enabling interrupt can be > done in > dpc_work() or threaded_handler() after we are done with > pcie_do_fatal_recovery() > > dpc_reset_link() > { > following can be moved after pcie_do_fatal_recovery() ?? > pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS, > PCI_EXP_DPC_STATUS_TRIGGER); > > pci_read_config_word(pdev, cap + PCI_EXP_DPC_CTL, &ctl); > pci_write_config_word(pdev, cap + PCI_EXP_DPC_CTL, > ctl | PCI_EXP_DPC_CTL_INT_EN); > } > > and probably following also one also can be done after recovery. > > pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS, > PCI_EXP_DPC_STATUS_INTERRUPT); > > I am not sure if this was the concern or I did talk about something else all > together !