Hi Guenter, On Mon, Oct 12, 2015 at 12:10:13PM -0700, Guenter Roeck wrote: > Some oddball devices may experience a PCIe link flap after power-on. > This may result in the following sequence of events. > > fpc0 kernel: pciehp 0000:02:08.0:pcie24: Card present on Slot(0) > fpc0 kernel: pciehp 0000:02:08.0:pcie24: slot(0): Link Up event > fpc0 kernel: pciehp 0000:02:08.0:pcie24: Link Up event ignored on slot(0): already powering on > fpc0 kernel: pciehp 0000:02:08.0:pcie24: slot(0): Link Down event > fpc0 kernel: pciehp 0000:02:08.0:pcie24: Link Down event queued on slot(0): currently getting powered on > fpc0 kernel: pciehp 0000:02:08.0:pcie24: slot(0): Link Up event > fpc0 kernel: pciehp 0000:02:08.0:pcie24: Link Up event queued on slot(0): currently getting powered off I'm really interested in how this happens. I'm not happy with the pciehp state machine, and I wonder if it is causing or obscuring this problem. For example, pcie_isr() reads PCI_EXP_SLTSTA to find out what happened. Then it queues events (attention button, presence detect changed, etc.) Along the way, we re-read PCI_EXP_SLTSTA. That seems wrong to me -- I think we should capture the state *once*, save it, and act on it. If we re-read it, we'll see transitions that might confuse us. Similarly, pcie_isr() reads and acts on PCI_EXP_LNKSTA (via pciehp_check_link_active()). I think we should capture and save that at the same time we capture PCI_EXP_SLTSTA, before we queue up work events that are going to change things. And I'm not sure it's really necessary for pcie_isr() to queue up *separate* events for attention button, presence detect, power fault, and link state events. I suspect that could throw in extraneous events that confuse things. For instance, I think it's possible for pcie_isr() to see a single interrupt with PCI_EXP_SLTSTA showing both a presence detect change (card now present) and a link state change (link now up). It will queue two events and we'll probably see a "Link Up event ignored" message. I think it would be better if pcie_isr() captured the register values we need, bundled them up, and queued a single work item to deal with them. > This causes the driver for affected devices to be instantiated, removed, > and re-instantiated. This can result in problems, for example if the device > in question provides gpio pins or interrupts which are used by other > drivers. For example, the removal can result in errors such as > the following. > > fpc0 kernel: remove_proc_entry: removing non-empty directory 'irq/148', > leaking at least 'pic0' > fpc0 kernel: ------------[ cut here ]------------ > fpc0 kernel: WARNING: at /home/p2020/linux-freescale/fs/proc/generic.c:575 Something's definitely wrong here, but shouldn't we be able to add and remove a device as many times as we want, as quickly as we want, without problems? Maybe this particular issue is a driver problem or a core problem with the proc file maintenance? I wonder if this is something we could reproduce without pciehp at all, just by inserting and removing a module? > Add support for per-port power-on delay to avoid this situation. This can > be enabled globally with the pciehp_poweron_delay module parameter, or > per port (using a quirks function) with a new poweron_delay flag in > struct pci_dev. > > With this patch, the link flap still occurs, but because of the delayed > insertion the driver is not immediately instantiated, and the above error > is no longer seen. We might have to do something like this eventually, but I'd really like to see if we can simplify the pciehp state machine a little before we add more stuff to it. > Signed-off-by: Guenter Roeck <linux@xxxxxxxxxxxx> > --- > We started seeing this problem after the recent rework of link status > change handling. I think the up/down/up sequence was previously just > ignored or folded into a single event (and sometimes resulted in a stuck > state machine). > I would like to see this patch upstream, but I am not sure if the problem > is seen anywhere but on the hardware I am dealing with, and I can > understand if others don't want to pollute the pcie hotplug code with > such workarounds. Also, maybe someone has a better idea on how to handle > the situation, so I marked the patch as RFC. > > drivers/pci/hotplug/pciehp.h | 5 ++++- > drivers/pci/hotplug/pciehp_core.c | 3 +++ > drivers/pci/hotplug/pciehp_ctrl.c | 44 +++++++++++++++++++++++++++------------ > drivers/pci/hotplug/pciehp_hpc.c | 2 ++ > include/linux/pci.h | 1 + > 5 files changed, 41 insertions(+), 14 deletions(-) > > diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h > index 62d6fe6c3714..5047bde7d51d 100644 > --- a/drivers/pci/hotplug/pciehp.h > +++ b/drivers/pci/hotplug/pciehp.h > @@ -40,6 +40,7 @@ > > #define MY_NAME "pciehp" > > +extern bool pciehp_poweron_delay; > extern bool pciehp_poll_mode; > extern int pciehp_poll_time; > extern bool pciehp_debug; > @@ -98,6 +99,7 @@ struct controller { > unsigned int cmd_busy:1; > unsigned int link_active_reporting:1; > unsigned int notification_enabled:1; > + unsigned int poweron_delay:1; /* Delay poweron for this slot */ > unsigned int power_fault_detected; > }; > > @@ -112,7 +114,8 @@ struct controller { > #define BLINKINGON_STATE 1 > #define BLINKINGOFF_STATE 2 > #define POWERON_STATE 3 > -#define POWEROFF_STATE 4 > +#define DELAYED_POWERON_STATE 4 > +#define POWEROFF_STATE 5 > > #define ATTN_BUTTN(ctrl) ((ctrl)->slot_cap & PCI_EXP_SLTCAP_ABP) > #define POWER_CTRL(ctrl) ((ctrl)->slot_cap & PCI_EXP_SLTCAP_PCP) > diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c > index 612b21a14df5..cc69fd10d884 100644 > --- a/drivers/pci/hotplug/pciehp_core.c > +++ b/drivers/pci/hotplug/pciehp_core.c > @@ -38,6 +38,7 @@ > #include <linux/time.h> > > /* Global variables */ > +bool pciehp_poweron_delay; > bool pciehp_debug; > bool pciehp_poll_mode; > int pciehp_poll_time; > @@ -51,10 +52,12 @@ MODULE_AUTHOR(DRIVER_AUTHOR); > MODULE_DESCRIPTION(DRIVER_DESC); > MODULE_LICENSE("GPL"); > > +module_param(pciehp_poweron_delay, bool, 0644); > module_param(pciehp_debug, bool, 0644); > module_param(pciehp_poll_mode, bool, 0644); > module_param(pciehp_poll_time, int, 0644); > module_param(pciehp_force, bool, 0644); > +MODULE_PARM_DESC(pciehp_poweron_delay, "Delay port power-on"); > MODULE_PARM_DESC(pciehp_debug, "Debugging mode enabled or not"); > MODULE_PARM_DESC(pciehp_poll_mode, "Using polling mechanism for hot-plug events or not"); > MODULE_PARM_DESC(pciehp_poll_time, "Polling mechanism frequency, in seconds"); > diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c > index ef96a1d51fac..fd829c2b7418 100644 > --- a/drivers/pci/hotplug/pciehp_ctrl.c > +++ b/drivers/pci/hotplug/pciehp_ctrl.c > @@ -204,11 +204,19 @@ static void pciehp_power_thread(struct work_struct *work) > kfree(info); > } > > -void pciehp_queue_power_work(struct slot *p_slot, int req) > +void pciehp_queue_power_work(struct slot *p_slot, int req, bool delay) > { > struct power_work_info *info; > > - p_slot->state = req == ENABLE_REQ ? POWERON_STATE : POWEROFF_STATE; > + if (req == ENABLE_REQ) { > + p_slot->state = delay ? DELAYED_POWERON_STATE : POWERON_STATE; > + if (delay) { > + mod_delayed_work(p_slot->wq, &p_slot->work, HZ); > + return; > + } > + } else { > + p_slot->state = POWEROFF_STATE; > + } > > info = kmalloc(sizeof(*info), GFP_KERNEL); > if (!info) { > @@ -229,10 +237,11 @@ void pciehp_queue_pushbutton_work(struct work_struct *work) > mutex_lock(&p_slot->lock); > switch (p_slot->state) { > case BLINKINGOFF_STATE: > - pciehp_queue_power_work(p_slot, DISABLE_REQ); > + pciehp_queue_power_work(p_slot, DISABLE_REQ, false); > break; > case BLINKINGON_STATE: > - pciehp_queue_power_work(p_slot, ENABLE_REQ); > + case DELAYED_POWERON_STATE: > + pciehp_queue_power_work(p_slot, ENABLE_REQ, false); > break; > default: > break; > @@ -263,7 +272,7 @@ static void handle_button_press_event(struct slot *p_slot) > /* blink green LED and turn off amber */ > pciehp_green_led_blink(p_slot); > pciehp_set_attention_status(p_slot, 0); > - queue_delayed_work(p_slot->wq, &p_slot->work, 5*HZ); > + mod_delayed_work(p_slot->wq, &p_slot->work, 5 * HZ); > break; > case BLINKINGOFF_STATE: > case BLINKINGON_STATE: > @@ -285,6 +294,7 @@ static void handle_button_press_event(struct slot *p_slot) > break; > case POWEROFF_STATE: > case POWERON_STATE: > + case DELAYED_POWERON_STATE: > /* > * Ignore if the slot is on power-on or power-off state; > * this means that the previous attention button action > @@ -305,12 +315,12 @@ static void handle_surprise_event(struct slot *p_slot) > { > u8 getstatus; > > + if (p_slot->state == DELAYED_POWERON_STATE) > + cancel_delayed_work(&p_slot->work); > + > pciehp_get_adapter_status(p_slot, &getstatus); > - if (!getstatus) { > - pciehp_queue_power_work(p_slot, DISABLE_REQ); > - } else { > - pciehp_queue_power_work(p_slot, ENABLE_REQ); > - } > + pciehp_queue_power_work(p_slot, getstatus ? ENABLE_REQ : DISABLE_REQ, > + p_slot->ctrl->poweron_delay); > } > > /* > @@ -327,8 +337,13 @@ static void handle_link_event(struct slot *p_slot, u32 event) > /* Fall through */ > case STATIC_STATE: > pciehp_queue_power_work(p_slot, event == INT_LINK_UP ? > - ENABLE_REQ : DISABLE_REQ); > + ENABLE_REQ : DISABLE_REQ, > + ctrl->poweron_delay); > break; > + case DELAYED_POWERON_STATE: > + if (event != INT_LINK_UP) > + cancel_delayed_work(&p_slot->work); > + /* Fall through */ > case POWERON_STATE: > if (event == INT_LINK_UP) { > ctrl_info(ctrl, > @@ -338,7 +353,7 @@ static void handle_link_event(struct slot *p_slot, u32 event) > ctrl_info(ctrl, > "Link Down event queued on slot(%s): currently getting powered on\n", > slot_name(p_slot)); > - pciehp_queue_power_work(p_slot, DISABLE_REQ); > + pciehp_queue_power_work(p_slot, DISABLE_REQ, false); > } > break; > case POWEROFF_STATE: > @@ -346,7 +361,8 @@ static void handle_link_event(struct slot *p_slot, u32 event) > ctrl_info(ctrl, > "Link Up event queued on slot(%s): currently getting powered off\n", > slot_name(p_slot)); > - pciehp_queue_power_work(p_slot, ENABLE_REQ); > + pciehp_queue_power_work(p_slot, ENABLE_REQ, > + ctrl->poweron_delay); > } else { > ctrl_info(ctrl, > "Link Down event ignored on slot(%s): already powering off\n", > @@ -482,6 +498,7 @@ int pciehp_sysfs_enable_slot(struct slot *p_slot) > p_slot->state = STATIC_STATE; > break; > case POWERON_STATE: > + case DELAYED_POWERON_STATE: > ctrl_info(ctrl, "Slot %s is already in powering on state\n", > slot_name(p_slot)); > break; > @@ -522,6 +539,7 @@ int pciehp_sysfs_disable_slot(struct slot *p_slot) > break; > case BLINKINGON_STATE: > case POWERON_STATE: > + case DELAYED_POWERON_STATE: > ctrl_info(ctrl, "Already disabled on slot %s\n", > slot_name(p_slot)); > break; > diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c > index 5c24e938042f..7c7a598eec9f 100644 > --- a/drivers/pci/hotplug/pciehp_hpc.c > +++ b/drivers/pci/hotplug/pciehp_hpc.c > @@ -809,6 +809,8 @@ struct controller *pcie_init(struct pcie_device *dev) > pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &link_cap); > if (link_cap & PCI_EXP_LNKCAP_DLLLARC) > ctrl->link_active_reporting = 1; > + if (pciehp_poweron_delay || dev->port->poweron_delay) > + ctrl->poweron_delay = 1; > > /* Clear all remaining event bits in Slot Status register */ > pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, > diff --git a/include/linux/pci.h b/include/linux/pci.h > index e90eb22de628..9cb0fe3037b9 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -359,6 +359,7 @@ struct pci_dev { > unsigned int io_window_1k:1; /* Intel P2P bridge 1K I/O windows */ > unsigned int irq_managed:1; > unsigned int has_secondary_link:1; > + unsigned int poweron_delay:1; /* Port needs poweron delay */ > pci_dev_flags_t dev_flags; > atomic_t enable_cnt; /* pci_enable_device has been called */ > > -- > 2.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html