Re: Should a PCIe Link Down event set the PCI_DEV_DISCONNECTED bit?

<Alex_Gagniuc@xxxxxxxxxxxx> · Wed, 1 Aug 2018 19:06:24 +0000

On 08/01/2018 03:57 AM, David Laight wrote:
> From: Alex_Gagniuc@xxxxxxxxxxxx
>> Sent: 31 July 2018 17:36
>>
>> On 07/31/2018 04:29 AM, Lukas Wunner wrote:
>>> On Mon, Jul 30, 2018 at 09:38:04PM +0000, Alex_Gagniuc@xxxxxxxxxxxx wrote:
>>>> On 07/28/2018 01:31 PM, Lukas Wunner wrote:
>>>>> On Fri, Jul 27, 2018 at 05:51:04PM +0000, Alex_Gagniuc@xxxxxxxxxxxx wrote:
>>>>>> I think PCI_DEV_DISCONNECTED is a documentation issue above all else.
>>>>>> The history I was given is that drivers would take a very long time to
>>>>>> tear down a device. Config space IO to an nonexistent device took a long
>>>>>> while to time out. Performance was one motivation -- and was not
>>>>>> documented.
>>>>>
>>>>> Often it is possible for the driver to detect surprise removal by
>>>>> checking if mmio reads return "all ones".  But in some cases that's
>>>>> a valid value to read from mmio and then this approach won't work.
>>>>> Also, checking every mmio read may negatively impact performance.
>>>>
>>>> A colleague and me beat that dead horse to the afterdeath. Consensus was
>>>> that the return value is less reliable than a coin toss (of a two-heads
>>>> coin).
> 
> Something cheap-ish to find out whether a -1 was caused by a card
> removal might be sensible - Especially if it can be done without
> a config space read.
> Clearly you can't check anything BEFORE doing the read.
> And reading the pci-id from config space isn't entirely useful.
> If the card has reset itself (and the link recovered) then you
> need to read a BAR register and check it is setup.
> 
> More interestingly a read request that is inside the bridge's address
> window but outside any BAR (fairly easy to setup if the target has
> a large BAR and a small one) will also timeout (and return -1) even
> though there is no failure of the link.
> 
> If the target supports AER the information about the failed cycle
> ends up in the target's AER registers - even if the host bridge
> doesn't support AER (or it is being ignored).
> So it might be useful being able to read the AER registers even when
> no AER interrupt (or other notification) actually happens.

There are a number of ways to know a device is kaput. Information from 
AER and DPC has proven to be the most reliable. So much, that for the 
problems I am trying to solve, this information is necessary and sufficient.

> I've not managed to get linux to pick up AER interrupts even on
> systems where the hardware clearly supports them (at least on
> some slots).  I suspect the BIOS is carefully disabling them
> because of reports of message logs being spammed with AER errors.

I suspect you've hit an FFS bug.

> We also have one system (possibly a Dell 740)

Not sure we make a "possibly 740" model. Let me ask around.

> where any failure of a PCIe link leads to an NMI and a kernel crash!

The kernel crash is a linux bug. I've worked on that extensively in the 
past. We tried to fix it [1]. Unfortunately, due to an unprofessional 
maintainer and months of spinning in circles, word came that our 
resources are better spent elsewhere. Feel free to pick up where we left 
off.

> Not entirely useful in a server model that is supposed to have
> resilience against various errors.

You're preaching to the choir. The architecture and features are driven 
by customer demand. A lot of those "features" -- I haven't asked what 
they are -- are easily implemented with FFS. If you have a problem with 
FFS in particular -- and I do realize a lot of the specs around FFS are 
poorly written and not well thought out -- then it's marketing, sales 
and corporate that should know.

Here's the thing. I think the FW's job is to do the absolute minimal 
initialization to pass on control to the OS. uboot/linux stacks execute 
this beautifully. But customers want features, that very often, OS 
vendors are hesitant or outright refusing to implement. The only 
remaining place to implement them is the platform.

It sucks, but it's how things are. Anyway, the patches at [1] should 
solve your system crashing issue.

Alex

[1] https://lore.kernel.org/patchwork/patch/908811/