On Mon, Jan 23, 2012 at 8:06 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > On Sat, Jan 21, 2012 at 1:52 AM, Yinghai Lu <yinghai@xxxxxxxxxx> wrote: >> >> + /* >> + * pci_stop_bus_device(dev) will not remove dev from bus->devices list, >> + * so We don't need use _safe version for_each here. >> + * Also _safe version has problem when pci_stop_bus_device() for PF try >> + * to remove VFs. >> + */ >> + for (l = head->next; l != head;) { > > That's crazy. Why would you open-code this? Why isn't it just a > "list_for_each()"? I have previous version used list_for_each(), but Kenji thought we should open version because it could be clear that l is updated in the loop. > > And what are the problems with the safe version? If the safe version > doesn't work, then something is *seriously* wrong with the list. in list_for_each_safe() #define list_for_each_safe(pos, n, head) \ for (pos = (head)->next, n = pos->next; pos != (head); \ pos = n, n = pos->next) n is saved before, and safe only mean pos could be freed from the list, but n still can be used for next loop. in our case, the list have PF and several VFs, when pci_stop_bus_device() is called for PF, pos are still valid, but VFs are removed from the list. so n will not be valid. > >> + struct pci_dev *dev = pci_dev_b(l); >> + >> + /* >> + * VFs are removed by pci_remove_bus_device() in the >> + * pci_stop_bus_devices() code path for PF. >> + * aka, bus->devices get updated in the process. >> + * barrier() will make sure we get right next from that list. >> + */ >> + if (!dev->is_virtfn) { >> + pci_stop_bus_device(dev); >> + barrier(); >> + } > > And this is just insanity. The "barrier()" cannot *possibly* do > anything sane. If it really makes a difference, there is again some > serious problem with the whole f*cking thing. > > NAK on the patch until sanity is restored. This is just total voodoo > programming. Sorry for that. Can you please check V1 version ? https://lkml.org/lkml/2011/10/15/141 or from attached one. Thanks Yinghai
From: Yinghai Lu <yinghai@xxxxxxxxx> Subject: [PATCH 01/10] PCI: Make sriov work with hotplug remove When hot remove pci express module that have pcie switch and support SRIOV, got [ 5918.610127] pciehp 0000:80:02.2:pcie04: pcie_isr: intr_loc 1 [ 5918.615779] pciehp 0000:80:02.2:pcie04: Attention button interrupt received [ 5918.622730] pciehp 0000:80:02.2:pcie04: Button pressed on Slot(3) [ 5918.629002] pciehp 0000:80:02.2:pcie04: pciehp_get_power_status: SLOTCTRL a8 value read 1f9 [ 5918.637416] pciehp 0000:80:02.2:pcie04: PCI slot #3 - powering off due to button press. [ 5918.647125] pciehp 0000:80:02.2:pcie04: pcie_isr: intr_loc 10 [ 5918.653039] pciehp 0000:80:02.2:pcie04: pciehp_green_led_blink: SLOTCTRL a8 write cmd 200 [ 5918.661229] pciehp 0000:80:02.2:pcie04: pciehp_set_attention_status: SLOTCTRL a8 write cmd c0 [ 5924.667627] pciehp 0000:80:02.2:pcie04: Disabling domain:bus:device=0000:b0:00 [ 5924.674909] pciehp 0000:80:02.2:pcie04: pciehp_get_power_status: SLOTCTRL a8 value read 2f9 [ 5924.683262] pciehp 0000:80:02.2:pcie04: pciehp_unconfigure_device: domain:bus:dev = 0000:b0:00 [ 5924.693976] libfcoe_device_notification: NETDEV_UNREGISTER eth6 [ 5924.764979] libfcoe_device_notification: NETDEV_UNREGISTER eth14 [ 5924.873539] libfcoe_device_notification: NETDEV_UNREGISTER eth15 [ 5924.995209] libfcoe_device_notification: NETDEV_UNREGISTER eth16 [ 5926.114407] sxge 0000:b2:00.0: PCI INT A disabled [ 5926.119342] BUG: unable to handle kernel NULL pointer dereference at (null) [ 5926.127189] IP: [<ffffffff81353a3b>] pci_stop_bus_device+0x33/0x83 [ 5926.133377] PGD 0 [ 5926.135402] Oops: 0000 [#1] SMP [ 5926.138659] CPU 2 [ 5926.140499] Modules linked in: ... [ 5926.143754] [ 5926.275823] Call Trace: [ 5926.278267] [<ffffffff81353a38>] pci_stop_bus_device+0x30/0x83 [ 5926.284180] [<ffffffff81353af4>] pci_remove_bus_device+0x1a/0xba [ 5926.290264] [<ffffffff81366311>] pciehp_unconfigure_device+0x110/0x17b [ 5926.296866] [<ffffffff81365dd9>] ? pciehp_disable_slot+0x188/0x188 [ 5926.303123] [<ffffffff81365d6f>] pciehp_disable_slot+0x11e/0x188 [ 5926.309206] [<ffffffff81365e68>] pciehp_power_thread+0x8f/0xe0 ... +-[0000:80]-+-00.0-[81-8f]-- | +-01.0-[90-9f]-- | +-02.0-[a0-af]-- | +-02.2-[b0-bf]----00.0-[b1-b3]--+-02.0-[b2]--+-00.0 Device | | | +-00.1 Device | | | +-00.2 Device | | | \-00.3 Device | | \-03.0-[b3]--+-00.0 Device | | +-00.1 Device | | +-00.2 Device | | \-00.3 Device root complex: 80:02.2 pci express modules: have pcie switch and are listed as b0:00.0, b1:02.0 and b1:03.0. end devices are b2:00.0 and b3.00.0. VFs are: b2:00.1,... b2:00.3, and b3:00.1,...,b3:00.3 Root cause: when doing pci_stop_bus_device() with phys fn, it will stop virt fn and remove the fn, so list_for_each_safe(l, n, &bus->devices) will have problem to refer freed n that is pointed to vf entry. Solution is just call pci_stop_bus_device() with phys fn only. and before that need to save phys fn aside and avoid to use bus->devices to loop. During reviewing the patch, Bjorn said: | The PCI hot-remove path calls pci_stop_bus_devices() via | pci_remove_bus_device(). | | pci_stop_bus_devices() traverses the bus->devices list (point A below), | stopping each device in turn, which calls the driver remove() method. When | the device is an SR-IOV PF, the driver calls pci_disable_sriov(), which | also uses pci_remove_bus_device() to remove the VF devices from the | bus->devices list (point B). | | pci_remove_bus_device | pci_stop_bus_device | pci_stop_bus_devices(subordinate) | list_for_each(bus->devices) <-- A | pci_stop_bus_device(PF) | ... | driver->remove | pci_disable_sriov | ... | pci_remove_bus_device(VF) | <remove from bus_list> <-- B | | At B, we're changing the same list we're iterating through at A, so when | the driver remove() method returns, the pci_stop_bus_devices() iterator has | a pointer to a list entry that has already been freed. | | This patch avoids the problem by building a separate list of all PFs on | the bus and traversing that at A instead of the bus->devices list. Discussion thread can be found : https://lkml.org/lkml/2011/10/15/141 Signed-off-by: Yinghai Lu <yinghai@xxxxxxxxxx> --- drivers/pci/remove.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) Index: linux-2.6/drivers/pci/remove.c =================================================================== --- linux-2.6.orig/drivers/pci/remove.c +++ linux-2.6/drivers/pci/remove.c @@ -120,10 +120,43 @@ void pci_remove_behind_bridge(struct pci pci_remove_bus_device(pci_dev_b(l)); } +struct dev_list { + struct pci_dev *dev; + struct list_head list; +}; + static void pci_stop_bus_devices(struct pci_bus *bus) { struct list_head *l, *n; + struct dev_list *dl, *dn; + LIST_HEAD(physfn_list); + + /* Save phys_fn aside at first */ + list_for_each(l, &bus->devices) { + struct pci_dev *dev = pci_dev_b(l); + + if (!dev->is_virtfn) { + dl = kmalloc(sizeof(*dl), GFP_KERNEL); + if (!dl) + continue; + dl->dev = dev; + list_add_tail(&dl->list, &physfn_list); + } + } + + /* + * stop bus device for phys_fn at first + * it will stop and remove vf in driver remove action + */ + list_for_each_entry_safe(dl, dn, &physfn_list, list) { + struct pci_dev *dev = dl->dev; + + pci_stop_bus_device(dev); + + kfree(dl); + } + /* Do it again for left over in case */ list_for_each_safe(l, n, &bus->devices) { struct pci_dev *dev = pci_dev_b(l); pci_stop_bus_device(dev);