Re: [PATCH v6 1/1] PCI: pciehp: Ignore link events when there is a fatal error pending

Lukas Wunner <lukas@xxxxxxxxx> · Sun, 29 Jul 2018 21:07:55 +0200

On Sun, Jul 29, 2018 at 11:30:09AM -0700, Sinan Kaya wrote:
> Yes, slot power needs to be kept on.
> 
> pciehp shouldn't attempt recovery.
> 
> If link goes down due to a DPC event, it should be recovered by DPC status
> trigger. Injecting a cold reset in the middle can cause a HW
> lockup as it is an undefined behavior.
> 
> Similarly, If link goes down due to an AER secondary bus reset issue, it
> should be recovered by HW. Injecting a cold reset in the middle of a
> secondary bus reset can cause a HW lockup as it is an undefined behavior.

Thanks a lot for the explanation, understood now.

> Maybe, this helps:
> 
> 1. HP ISR observes link down interrupt.
> 2. HP ISR checks that there is a fatal error pending, it doesn't touch
> the link.
> 3. HP ISR waits until link recovery happens.
> 4. HP ISR calls the read vendor id function.
> 
> DPC link recovery is very quick (100ms at most). Secondary bus reset
> recovery should be contained within 1 seconds for most cases but
> spec allows a device to extend vendor id read as much as it wants via
> CRS response. We poll up to an additional 60 seconds in read vendor
> id function.

Yes, that proposal makes a lot of sense to me.  This should also work
regardless whether pciehp or DPC/AER react first to the Link Down.
Could you rebase your patch on the current pci/hotplug branch
and insert the procedure you've outlined above at the top of
pciehp_handle_presence_or_link_change() in pciehp_ctrl.c,
or put it in a helper that's called at the top of that function.

Your patch "[PATCH v6 1/1] PCI: pciehp: Ignore link events when there
is a fatal error pending" only checks once for a pending fatal error,
it should poll until either the fatal error is gone or a timeout is
hit.  If the fatal error is gone and the link is up, you can just return
from pciehp_handle_presence_or_link_change().  Else (in the timeout case)
fall back to the normal handling of a Link Down, i.e. let it bring down
the slot.

Please add a code comment in pciehp_handle_presence_or_link_change()
along the lines of

	/* If a fatal error is pending, wait for AER or DPC to handle it. */

The information in your e-mail that a cold reset would incorrectly
interfere with error recovery is a crucial piece of information that
should be included at least in the commit message.  (I was unaware
of that.)

If you have any further questions on pciehp, ask away.

Thanks!

Lukas