Re: [PATCH V2, 0/1] PCI/AER: fix use-after-free in pcie_do_fatal_recovery

Keith Busch <keith.busch@xxxxxxxxx> · Mon, 16 Jul 2018 08:00:37 -0600

On Mon, Jul 16, 2018 at 12:25:07AM +0530, poza@xxxxxxxxxxxxxx wrote:
> On 2018-07-13 03:21, Bjorn Helgaas wrote:
> > On Thu, Jul 12, 2018 at 11:33:42AM -0600, Thomas Tai wrote:
> > > Hi Bjorn,
> > > Thank you very much to review the V1 patch. I reworked the patch as
> > > you have suggested. Your suggestion is indeed better and cleaner.
> > > Would you please have a look for me?
> > > 
> > > FYI, I did not clear the device structure in aer_isr_one_error()
> > > just to avoid over complicated the fix.
> > 
> > I think that was a bad idea anyway.  I think the problem there is that
> > add_error_device() copies the pci_dev pointer without taking a
> > reference on it:
> > 
> >   aer_isr
> >     get_e_source              # dequeue from rpc->e_sources[]
> >       *e_src = rpc->e_sources[x]
> >     aer_isr_one_error
> >       pdev = rpc->rpd         # pdev = Root Port pci_dev
> >       find_source_device(pdev)
> > 	find_device_iter
> > 	  add_error_device
> > 	    e_info->dev[i] = dev   # <-- save pci_dev pointer
> >       aer_process_err_devices
> > 	handle_error_source(e_info->dev[i])
> > 	  pcie_do_fatal_recovery
> > 
> > At this point, we have pci_dev pointers in e_info->dev[] that are
> > potentially stale.  Right now nobody uses them, but I think it's
> > sloppy that we still have them at all.
> > 
> > I think a better way to fix this would be to call pci_dev_get() in
> > add_error_device() when we save the pointer, and then do the
> > corresponding pci_dev_put() in aer_isr_one_error(), after we return
> > from aer_process_err_devices().
> > 
> > This could be done with a little helper function that iterates through
> > e_info->dev[i], calls pci_dev_put() for each, and clears
> > e_info->error_dev_num (which could then be removed from
> > find_source_device()).
> > 
> > I think this would fix the problem you're seeing, so we wouldn't need
> > the change in pcie_do_fatal_recovery().
> > 
> > However, I think we're also slightly exposed in dpc_work(), in basically
> > the same (possibly harmless) way.
> > 
> >   dpc_irq
> >     schedule_work(&dpc->work)
> >   ...
> >   dpc_work
> >     pdev = dpc->dev->port
> >     pcie_do_fatal_recovery(pdev)
> > 
> > pdev may be removed by pcie_do_fatal_recovery(), but dpc_work() is still
> > holding onto a pointer (which it never uses again).
> > 
> > The DPC driver should be holding a reference to pdev (through some black
> > magic I don't understand), but that would be released when pdev is
> > removed,
> > and I don't know what ensures that dpc_work() runs before that release.
> > 
> 
> Is not that dpc_work() is calling pcie_do_fatal_recovery() which in-turn is
> removing pdev.
> I do not see DPC is using any more reference beyond that.
> So, isnt it safe to assume that no matter when the dpc_work() run, the pdev
> it reference is always intact.
> this is with the fact, that both AER and DPC can not trigger simultaneously,
> and and hence no race for removal.
> who else would remove that pdev other than AER and DPC ?

It's not a problem for the devices below the contained port. It's more
about the port itself, which may happen if you hot-remove the switch
that has DPC services: nothing coordinates that switch's removal with the
dpc_work so the dpc work could be referencing a pdev that pciehp removed
earlier, for example. Removing switches currently is not very common as
far as I know, but there are definitely users that want that ability.

> With Keith's patch, are not we just improving the latency with threaded irq
> instead of workqueue ?

It's a little more than that. The irq subsystem will sync with the
threaded irq before allowing teardown of the device to complete, so that
should ensure we don't have a use-after-free in the dpc bottom half.

> the potential problem is another interrupt arriving while we have
> acknowledged it in dpc_reset_link() and we are about to remove the devices.
> The acknowledgement of DPC status trigger and re-enabling interrupt can be
> done in
> dpc_work() or threaded_handler() after we are done with
> pcie_do_fatal_recovery()
> 
> dpc_reset_link()
> {
>         following can be moved after pcie_do_fatal_recovery() ??
> 	pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
> 			      PCI_EXP_DPC_STATUS_TRIGGER);
> 
> 	pci_read_config_word(pdev, cap + PCI_EXP_DPC_CTL, &ctl);
> 	pci_write_config_word(pdev, cap + PCI_EXP_DPC_CTL,
> 			      ctl | PCI_EXP_DPC_CTL_INT_EN);
> }
> 
> and probably following also one also can be done after recovery.
> 
> pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
> 			      PCI_EXP_DPC_STATUS_INTERRUPT);
> 
> I am not sure if this was the concern or I did talk about something else all
> together !