On Mon, Oct 17, 2011 at 12:08 PM, Yinghai Lu <yinghai.lu@xxxxxxxxxx> wrote: > On 10/17/2011 10:16 AM, Bjorn Helgaas wrote: >> >> On Sat, Oct 15, 2011 at 7:31 PM, Yinghai Lu<yinghai.lu@xxxxxxxxxx> wrote: >>> >>> When hot remove pci express module, got >>> >>> [ 5918.610127] pciehp 0000:80:02.2:pcie04: pcie_isr: intr_loc 1 >>> [ 5918.615779] pciehp 0000:80:02.2:pcie04: Attention button interrupt >>> received >>> [ 5918.622730] pciehp 0000:80:02.2:pcie04: Button pressed on Slot(3) >>> [ 5918.629002] pciehp 0000:80:02.2:pcie04: pciehp_get_power_status: >>> SLOTCTRL a8 value read 1f9 >>> [ 5918.637416] pciehp 0000:80:02.2:pcie04: PCI slot #3 - powering off due >>> to button press. >>> [ 5918.647125] pciehp 0000:80:02.2:pcie04: pcie_isr: intr_loc 10 >>> [ 5918.653039] pciehp 0000:80:02.2:pcie04: pciehp_green_led_blink: >>> SLOTCTRL a8 write cmd 200 >>> [ 5918.661229] pciehp 0000:80:02.2:pcie04: pciehp_set_attention_status: >>> SLOTCTRL a8 write cmd c0 >>> [ 5924.667627] pciehp 0000:80:02.2:pcie04: Disabling >>> domain:bus:device=0000:b0:00 >>> [ 5924.674909] pciehp 0000:80:02.2:pcie04: pciehp_get_power_status: >>> SLOTCTRL a8 value read 2f9 >>> [ 5924.683262] pciehp 0000:80:02.2:pcie04: pciehp_unconfigure_device: >>> domain:bus:dev = 0000:b0:00 >>> [ 5924.693976] libfcoe_device_notification: NETDEV_UNREGISTER eth6 >>> [ 5924.764979] libfcoe_device_notification: NETDEV_UNREGISTER eth14 >>> [ 5924.873539] libfcoe_device_notification: NETDEV_UNREGISTER eth15 >>> [ 5924.995209] libfcoe_device_notification: NETDEV_UNREGISTER eth16 >>> [ 5926.114407] sxge 0000:b2:00.0: PCI INT A disabled >>> [ 5926.119342] BUG: unable to handle kernel NULL pointer dereference at >>> (null) >>> [ 5926.127189] IP: [<ffffffff81353a3b>] pci_stop_bus_device+0x33/0x83 >>> [ 5926.133377] PGD 0 >>> [ 5926.135402] Oops: 0000 [#1] SMP >>> [ 5926.138659] CPU 2 >>> [ 5926.140499] Modules linked in: >>> ... >>> [ 5926.143754] >>> [ 5926.275823] Call Trace: >>> [ 5926.278267] [<ffffffff81353a38>] pci_stop_bus_device+0x30/0x83 >>> [ 5926.284180] [<ffffffff81353af4>] pci_remove_bus_device+0x1a/0xba >>> [ 5926.290264] [<ffffffff81366311>] >>> pciehp_unconfigure_device+0x110/0x17b >>> [ 5926.296866] [<ffffffff81365dd9>] ? pciehp_disable_slot+0x188/0x188 >>> [ 5926.303123] [<ffffffff81365d6f>] pciehp_disable_slot+0x11e/0x188 >>> [ 5926.309206] [<ffffffff81365e68>] pciehp_power_thread+0x8f/0xe0 >>> ... >>> >>> Root cause: when doing pci_stop_bus_device() with phys fn, it will stop >>> virt fn and >>> remove the fn, so >>> list_for_each_safe(l, n,&bus->devices) >>> will have problem to refer freed n that is pointed to vf entry. >>> >>> Solution is just call pci_stop_bus_device() with phys fn only. and before >>> that need to >>> save phys fn aside and avoid to use bus->devices to loop. >>> >>> Signed-off-by: Yinghai Lu<yinghai@xxxxxxxxxx> >>> >>> --- >>> drivers/pci/remove.c | 33 +++++++++++++++++++++++++++++++++ >>> 1 file changed, 33 insertions(+) >>> >>> Index: linux-2.6/drivers/pci/remove.c >>> =================================================================== >>> --- linux-2.6.orig/drivers/pci/remove.c >>> +++ linux-2.6/drivers/pci/remove.c >>> @@ -120,10 +120,43 @@ void pci_remove_behind_bridge(struct pci >>> pci_remove_bus_device(pci_dev_b(l)); >>> } >>> >>> +struct dev_list { >>> + struct pci_dev *dev; >>> + struct list_head list; >>> +}; >>> + >>> static void pci_stop_bus_devices(struct pci_bus *bus) >>> { >>> struct list_head *l, *n; >>> + struct dev_list *dl, *dn; >>> + LIST_HEAD(physfn_list); >>> + >>> + /* Save phys_fn aside at first */ >>> + list_for_each(l,&bus->devices) { >>> + struct pci_dev *dev = pci_dev_b(l); >>> + >>> + if (!dev->is_virtfn) { >>> + dl = kmalloc(sizeof(*dl), GFP_KERNEL); >>> + if (!dl) >>> + continue; >>> + dl->dev = dev; >>> + list_add_tail(&dl->list,&physfn_list); >>> + } >>> + } >>> + >>> + /* >>> + * stop bus device for phys_fn at first >>> + * it will stop and remove vf in driver remove action >>> + */ >>> + list_for_each_entry_safe(dl, dn,&physfn_list, list) { >>> + struct pci_dev *dev = dl->dev; >>> + >>> + pci_stop_bus_device(dev); >>> + >>> + kfree(dl); >>> + } >>> >>> + /* Do it again for left over in case */ >>> list_for_each_safe(l, n,&bus->devices) { >>> struct pci_dev *dev = pci_dev_b(l); >>> pci_stop_bus_device(dev); >> >> Apparently it's a problem to stop a VF before its PF? Why is that? >> Sounds like there's an important dependency here, but I can't figure >> it out. I would have naively expected that you would *want* to stop a >> VF first, since it depends on the PF. > > when you stop the VF, that VF will not be removed. > > when you stop the PF, the driver will be unloaded, it will *STOP* the VF at > first and *REMOVE* the VF there. So the bus->devices will be touch at that > time. > for example when you have three VFs. those VFs will be removed from > bus->devices, and n in the list_for_each_safe will have invalid pointer to > freed entry. > > So stop the VF at first will *NOT* help. No need to shout. I wasn't suggesting that stopping VFs first was the right solution. It's just that stopping PFs first violates the general design pattern of "discover top-down and remove bottom-up," so it's not obvious why this is correct. Let me see if I understand this. Here's the path where I think the problem is: pciehp_unconfigure_device pci_remove_bus_device pci_stop_bus_device pci_stop_bus_devices(subordinate) list_for_each(...) <-- A pci_stop_bus_device(PF) pci_stop_dev device_unregister ... driver->remove, e.g., igb_remove pci_disable_sriov sriov_disable virtfn_remove pci_remove_bus_device pci_destroy_dev <remove from bus_list> <-- B The hotplug removal starts with pci_remove_bus_device(), which calls pci_stop_bus_device() on the PF, which ultimately calls the driver->remove() method, which (since it's the PF) calls pci_disable_sriov(), which destroys all the VFs and re-enters pci_remove_bus_device() for each of them. We're manipulating the same list at A and B, so when we get back to A after removing all the VFs, the list has been changed out from under us. Your patch avoids the problem by first putting the PFs on a separate list and walking that instead, so the list at A is now different from the list at B. Right? Maybe this is the best we can do, but it still doesn't seem ideal, and it's certainly not obvious when reading the code. It doesn't seem right for the driver ->remove() method to be calling pci_destroy_dev(). Won't the core data structures be corrupted if a defective driver doesn't call pci_disable_sriov()? Seems like we could end up with a device that's been physically removed, but still has pci_dev structs hanging around. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html