Re: [PATCH V3, 1/1] PCI/AER: fix use-after-free in pcie_do_fatal_recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 07/26/2018 01:18 PM, Bjorn Helgaas wrote:
On Thu, Jul 26, 2018 at 10:29:18AM -0400, Thomas Tai wrote:
[ ... ]>
I know I suggested this strategy, but I think this ended up being more
complicated than it's worth.

The problem code in pcie_do_fatal_recovery() essentially looks like
this:

    pcie_do_fatal_recovery(dev)
      pci_stop_and_remove_bus_device(dev);
      reset_link(dev);
      pci_cleanup_aer_uncorrect_error_status(dev);
      pcie_wait_for_link(dev, ...);
      pci_uevent_ers(dev, ...);
      pci_info(dev, ...);

Some of this depends on the device type (bridge vs. endpoint) and the
caller (AER vs. DPC), but given the right conditions, we can exercise
all the above calls.

I think it is just broken that we keep doing things with "dev" after
removing it.  IMHO this code should be restructured to avoid that.

I think fiddling with the refcount as in this patch adds too much
complexity and makes it look like the current structure of
pcie_do_fatal_recovery() is reasonable when it really isn't.

But restructuring pcie_do_fatal_recovery() is too big a project to do
before v4.18, and we need to fix this problem.  I propose that we
merge your v2 patch for now, so at least the band-aid is in the
function that I think is broken.

I *would* like to reduce the scope of the get/put as in the patch
below, though, so it is contained inside the rescan_remove lock.
Could you try it and make sure it's still enough to avoid the problem?
If it is, I'll add your sign-off and get this in v4.18.

Hi Bjorn,
Thank you for your review and the details analysis. Sure, let's do the work
around for now. I retested your patch below and works fine. You are welcome
to add my signed-off and get this in v4.18.

OK, I added your signed-off-by and put the patch below on my for-linus
branch for v4.18.

Cool. Thank you Bjorn.


As far as reworking the pcie_do_fatal_recovery() goes, would you think I can
help out in any way? May be I can try rework the code to not use the dev
after it is removed.

That'd be great!  I expect Oza and Keith will have useful insight
there, too, so keep them in the loop.

Sure, I will keep Oza and Keith in the loop too.

Thank you,
Thomas



commit 277ce38f2ed6a4310acf3bd541fb3aee4ec27dee
Author: Thomas Tai <thomas.tai@xxxxxxxxxx>
Date:   Tue Jul 24 16:47:59 2018 -0500

      PCI/AER: Work around use-after-free in pcie_do_fatal_recovery()
      When an fatal error is received by a non-bridge device, the device is
      removed, and pci_stop_and_remove_bus_device() deallocates the device
      structure.  The freed device structure is used by subsequent code to send
      uevents and print messages.
      Hold a reference on the device until we're finished using it.  This is not
      an ideal fix because pcie_do_fatal_recovery() should not use the device at
      all after removing it, but that's too big a project for right now.
      #
      [bhelgaas: changelog, reduce get/put coverage]
      Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index fdbcc555860d..674984a9277a 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -291,6 +291,7 @@ void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service)
   	parent = udev->subordinate;
   	pci_lock_rescan_remove();
+	pci_dev_get(dev);
   	list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
   					 bus_list) {
   		pci_dev_get(pdev);
@@ -325,6 +326,7 @@ void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service)
   		pci_info(dev, "Device recovery from fatal error failed\n");
   	}
+	pci_dev_put(dev);
   	pci_unlock_rescan_remove();
   }

Signed-off-by: Thomas Tai <thomas.tai@xxxxxxxxxx>
---
   drivers/pci/pcie/aer.c | 27 +++++++++++++++++++++++++--
   1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e8838..6e5e6a5 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -657,6 +657,10 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
   static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
   {
   	if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
+		/* increment reference count to keep the dev
+		 * around until remove_source_device()
+		 */
+		pci_dev_get(dev);
   		e_info->dev[e_info->error_dev_num] = dev;
   		e_info->error_dev_num++;
   		return 0;
@@ -665,6 +669,21 @@ static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
   }
   /**
+ * remove_source_device -remove error devices from the e_info
+ * @e_info: pointer to error info
+ */
+static void remove_source_device(struct aer_err_info *e_info)
+{
+	struct pci_dev *dev;
+
+	while (e_info->error_dev_num > 0) {
+		e_info->error_dev_num--;
+		dev = e_info->dev[e_info->error_dev_num];
+		pci_dev_put(dev);
+	}
+}
+
+/**
    * is_error_source - check whether the device is source of reported error
    * @dev: pointer to pci_dev to be checked
    * @e_info: pointer to reported error info
@@ -976,8 +995,10 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
   			e_info->multi_error_valid = 0;
   		aer_print_port_info(pdev, e_info);
-		if (find_source_device(pdev, e_info))
+		if (find_source_device(pdev, e_info)) {
   			aer_process_err_devices(e_info);
+			remove_source_device(e_info);
+		}
   	}
   	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
@@ -995,8 +1016,10 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
   		aer_print_port_info(pdev, e_info);
-		if (find_source_device(pdev, e_info))
+		if (find_source_device(pdev, e_info)) {
   			aer_process_err_devices(e_info);
+			remove_source_device(e_info);
+		}
   	}
   }
--
1.8.3.1




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux