Re: [PATCH] PCI: pciehp: Differentiate between surprise and safe removal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Aug 02, 2018 at 12:59:18PM +0530, gokul cg wrote:
> I am suspecting a possible race condition in the kernel between PCI driver
> and AER handling.
> 
> Because of the same kernel panic happens from worker thread which handles
> bottom half of aer irq.
> 
> I am seeing this issue when I suddenly power off PCI card which
> supports/enabled PCIE AER error reporting.
> 
> While powering off PCI device, AER driver will get AER IRQ for the device,
> from AER IRQ handler, it will cache AER error code and schedule worker
> thread to handle error.
> 
> The PCIe device will get removed from PCI tree before worker thread
> completes its task and kernel panic is  happening when worker thread tries
> to access PCI device's config space.
> 
> #5 [ffff88027469fc70] general_protection at ffffffff8176cdf2
>     [exception RIP: pci_bus_read_config_dword+100]
> #6 [ffff88027469fd50] pci_find_next_ext_capability at ffffffff81345d7b
> #7 [ffff88027469fd90] pci_find_ext_capability at ffffffff81347225
> #8 [ffff88027469fda0] get_device_error_info at ffffffff81356c4d
> #9 [ffff88027469fdd0] aer_isr at ffffffff81357a38
> #10 [ffff88027469fe28] process_one_work at ffffffff8105d4c0
> #11 [ffff88027469fe70] worker_thread at ffffffff8105e251
> #12 [ffff88027469fed0] kthread at ffffffff81064260
> #13 [ffff88027469ff50] ret_from_fork at ffffffff81773a38
> 
> I have tested it on kernel 3.10 . But from source i could see that this
> case is still relevant for latest Linux source .

I'm not really familiar with the AER driver, but the problem is
actually easy to spot:

find_source_device() walks the hierarchy and saves a pointer to
pci_dev's in an array.  That array is later traversed and the
pci_dev's are accessed.

The solution is to acquire a ref on each device in add_error_device():

-	e_info->dev[e_info->error_dev_num] = dev;
+	e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);

Then release the ref aer_process_err_devices() by calling pci_dev_put().

I believe there's an ongoing refactoring of the AER driver and the
issue may be addressed in the course of it, but as a quick fix for
an ancient v3.10 kernel, the above should do the trick.

HTH,

Lukas



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux