On Fri, Feb 02, 2018 at 03:44:21PM +0100, Stefan Roese wrote: > On 02.02.2018 14:47, Lukas Wunner wrote: > >On Fri, Feb 02, 2018 at 02:38:34PM +0100, Stefan Roese wrote: > >>>On Tue, Jan 30, 2018 at 09:41:21AM +0100, Stefan Roese wrote: > >>>>Hotplugging of some PCIe devices on our platform sometimes leads to a > >>>>bounce of link-up and link-down events, resulting in problems in the > >>>>corresponding PCI drivers. > >>>> > >>>>Here an example of such a hotplug event bounce for a AHCI PCIe card: > >>>>... > >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up > >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on > >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down > >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up > >> > >>I'm open for other / better ideas on how to solve this situation, we > >>are seeing on our systems. This is definitely a real problem that should be fixed somehow. But I don't like the idea of a new module parameter because it's not very user-friendly. It would be very difficult for a user to identify the problem, discover the parameter, and figure out what debounce time to use. > >If a Link Up event is received and there is already a Link Up / Link Down > >pair in the queue, the Link Down event can be dequeued and the newly > >received Link Up event need not be queued. > > > >Same if a Link Down event is received and there is already a Link Down / > >Link Up pair in the queue. > > Makes sense. But I'm more often seeing this sequence here while > hot-plugging the PCIe card: > > [ 41.260667] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > [ 41.260731] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up > [ 41.290650] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down > [ 41.295837] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > [ 41.320664] pciehp 0000:00:1c.1:pcie004: Slot(1): Card not present > [ 41.330042] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present > [ 41.330110] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up > [ 41.375950] pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601 > ... > > So a link-down is following the link-up directly (~30ms here). Sometimes > a double link-up is also seen. But this one is more frequent in my test > cases. Unfortunately I don't have any easy ideas to offer. I do think the pciehp interrupt handling is baroque and I suspect that if we could simplify and rationalize it, some of these issues would take care of themselves. Bjorn