Re: [PATCH V5 3/3] PCI: Mask and unmask hotplug interrupts during reset

Sinan Kaya <okaya@xxxxxxxxxx> · Fri, 20 Jul 2018 19:58:20 -0700

On 7/20/2018 1:01 PM, Bjorn Helgaas wrote:
On Tue, Jul 10, 2018 at 02:30:11PM -0400, Sinan Kaya wrote:
On Mon, Jul 9, 2018 at 12:00 PM, Lukas Wunner <lukas@xxxxxxxxx> wrote:
On Mon, Jul 09, 2018 at 08:48:44AM -0600, Sinan Kaya wrote:
On 7/8/18, Lukas Wunner <lukas@xxxxxxxxx> wrote:
On Tue, Jul 03, 2018 at 11:43:26AM -0400, Sinan Kaya wrote:
My solution doesn't help if link down interrupt is observed
before the AER or DPC services.

If pciehp gets an interrupt quicker than dpc/aer, it will (at
least with my patches) remove all devices, check if the
presence bit is set, and if so, try to bring the slot up
again.

Hotplug driver should only observe a link down interrupt. Link
would come up in response to a secondary bus reset initiated by
the AER driver.

PCIe hotplug doesn't have separate Link Down and Link Up
interrupts, there is only a Link State *Changed* event.

Can you point me to the code that would bring up the link in hp
code?

I was referring to the situation with my recently posted pciehp
patches applied, in particular patch [21/32] ("PCI: pciehp: Become
resilient to missed events"):
https://patchwork.ozlabs.org/patch/930389/

When I get a presence or link changed event, I turn the slot off.
That includes removing all devices in the slot.  Because even if
the slot is still occupied or link is up, there was definitely a
change and the safe behavior is to assume that the card in the
slot is now a different one than before.

We do have a bit of mess unfortunately. Error handling and hotplug
drivers do not play nicely with each other.

When hotplug driver observes a link down, we are not checking if the
link down happened because user really wanted to remove a card or if
it was because it was originated by an error handling service such
as AER/DPC.

I'm thinking that we could potentially check if a hotplug event is
pending at the entrance of fatal error handling. If it is pending,
we could poll until the status bit clears. That should flush the
link down event.

Even then, link down indication of hotplug seem to turn off slot
power and LED.

If AER/DPC service runs after the hotplug driver, link won't come
back up as the power to the slot is turned off.

I'd like to hear about Bjorn's opinion before we throw something
else into this problem.

You guys know way more about this than I do.

I think the separation of AER/DPC/pciehp into separate drivers is
somewhat artificial because there are many interdependencies.  The
driver model doesn't apply very well because there's only one
underlying piece of hardware, which forces us to use the portdrv as
sort of a multiplexer.  The fact that portdrv claims these bridges
also means normal drivers (e.g., for performance counters) can't use
the usual model.

All that is to say that if integrating these services more tightly
would help solve this problem, I'd be open to that.

I was looking at how to destroy the portdrv for a while. It looks like
a much more bigger task to be honest. There are multiple levels of
abstractions in the code as you highlighted.

My patch solves the problem if AER interrupt happens before the hotplug
interrupt. We are masking the data link layer active interrupt. So,
AER/DPC can perform their link operations without hotplug driver race.

We need to figure out how to gracefully return inside hotplug driver
if link down happened and there is an error pending.

My first question is why hotplug driver is reacting to the link event
if there was not an actual device insertion/removal.

Would it help to keep track of presence changed interrupts since last
link event?

IF counter is 0 and device is present, hotplug driver bails out
silently as an example.

Bjorn