[cc += Thomas Tai] On Thu, Aug 02, 2018 at 10:46:57AM +0200, Lukas Wunner wrote: > On Thu, Aug 02, 2018 at 12:59:18PM +0530, gokul cg wrote: > > I am suspecting a possible race condition in the kernel between PCI driver > > and AER handling. > > The solution is to acquire a ref on each device in add_error_device(). > Then release the ref aer_process_err_devices() by calling pci_dev_put(). So in case it wasn't clear, the below is what I had in mind. Completely untested though. Does this work for you? For v3.10 compatibility, cherry-pick 89ee9f768003 (or alternatively cherry-pick 8496e85c20e7 and replace pci_dev_is_disconnected(dev) with !pci_device_is_present(dev)). -- >8 -- Subject: [PATCH] PCI/AER: Fix use-after-free on surprise removal The work item to consume errors, aer_isr(), walks the hierarchy using pci_walk_bus() and stores a pointer to PCI devices which reported an error in an array. As long as pci_walk_bus() runs, those pointers are valid because pci_bus_sem is held. But once pci_walk_bus() finishes, nothing prevents the pointers from becoming invalid, e.g. through unplugging of the PCI devices. The unprotected pointers are then dereferenced in aer_process_err_devices(), which may oops: #5 general_protection at ffffffff8176cdf2 [exception RIP: pci_bus_read_config_dword+100] #6 pci_find_next_ext_capability at ffffffff81345d7b #7 pci_find_ext_capability at ffffffff81347225 #8 get_device_error_info at ffffffff81356c4d #9 aer_isr at ffffffff81357a38 Fix by holding a ref on the devices until they have been processed. Skip processing of unplugged devices. Reported-by: gokul cg <gokuljnpr@xxxxxxxxx> Signed-off-by: Lukas Wunner <lukas@xxxxxxxxx> --- drivers/pci/pcie/aer.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index a2e8838..937592e 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -657,7 +657,7 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity, static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev) { if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) { - e_info->dev[e_info->error_dev_num] = dev; + e_info->dev[e_info->error_dev_num] = pci_dev_get(dev); e_info->error_dev_num++; return 0; } @@ -898,6 +898,9 @@ static int get_device_error_info(struct pci_dev *dev, struct aer_err_info *info) if (!pos) return 0; + if (pci_dev_is_disconnected(dev)) + return 0; + if (info->severity == AER_CORRECTABLE) { pci_read_config_dword(dev, pos + PCI_ERR_COR_STATUS, &info->status); @@ -948,6 +951,7 @@ static inline void aer_process_err_devices(struct aer_err_info *e_info) for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) { if (get_device_error_info(e_info->dev[i], e_info)) handle_error_source(e_info->dev[i], e_info); + pci_dev_put(e_info->dev[i]); } } -- 2.18.0