Re: PCI: hotplug: Erroneous removal of hotplug PCI devices

<Austin.Bolen@xxxxxxxx> · Fri, 25 Jan 2019 22:39:04 +0000

On 1/25/2019 2:22 AM, Lukas Wunner wrote:
> 
> [EXTERNAL EMAIL]
> 
> On Thu, Jan 24, 2019 at 10:00:49PM +0000, Austin.Bolen@xxxxxxxx wrote:
>> You're welcome!  And yes, this ECN is not yet in 5.0 draft but should be
>> there soon.  Much of what is in the ECN can be implemented with existing
>> DPC hardware.  There are associated ECNs in ACPI and PCI Firmware
>> Specifications as well.  The ACPI spec changes will not be public until
>> the 6.3 spec is released (soon).  The PCI Firmware Specification changes
>> should be ratified in a few weeks as well.  In the meantime, UEFI org
>> (which ACPI is a part of) has a git repo accessible to UEFI forum
>> members where the Linux code to support these new async hot-plug
>> mechanisms is being developed.  Will be made public once the specs are
>> public.
> 
> Unfortunately I don't have access to those ECNs or the non-public
> git repo.
> 

The PCIe Async Hot-Plug ECN has most of the new stuff. It has been 
released and has the same level of access as the PCIe Base Spec so if 
you have access to the PCIe base spec then you should have access to 
this ECN.

> 
>> Some people have slow hands :-).  This is a U.2 device.  There are 2
>> pins (Pin 4 = IfDet#/Pin 10 = PRSNT#) that determine presence and each
>> pin is a different length (On insert, IfDet# mates first, PRSNT# mates
>> second, PCIe data lanes mate 3rd).
> 
> But I've found a public spec for that:
> http://www.ssdformfactor.org/docs/SSD_Form_Factor_Version1_00.pdf
> >
>> These 2 pins determine if it is a SAS or NVMe drive.  We have to wait
>> until both pins mate in order to know for sure it's an NVMe drive and
>> therefore generate PRSNT# to DSP above the drive.  But there is no way
>> to know when both pins have mated due to the way the U.2 connector
>> signals were defined so we have to dead wait.
> 
> What I don't get is, a lot of the pins are used either for PCIe or SAS:
> If the host controller doesn't know yet whether it's PCIe or SAS, why does
> it already start driving them in PCIe mode and bring up the link?
>  > Also, those PCIe pins mate 3rd, so by the time the host controller starts
> driving those pins in PCIe mode, IfDet# and PRSNT# pins are known to be
> connected (because they mate earlier), so the host controller can stop
> waiting and start sampling the two pins?  Or is it mechanically possible
> to insert the drive in such a way that the pins do not mate before the
> PCIe pins, contrary to the spec?

In this case, there's not a single host controller.  There are two 
things at play.  The sideband signals (IfDet#, PRSNT#) are routed to 
logic on the platform motherboard or drive backplane.  The PCIe lanes 
are routed to a downstream port (Root Port or Switch Downstream port). 
The platform can see when the sideband signals connect but can't see 
when the link trains.  When the link trains and Data Link Layer goes to 
Active state, the DSP will notify OS via DLLSC interrupt but platform 
has no visibility into this (because we grant control of hot-plug to OS 
natively) and there's no way for platform to get notified when the link 
becomes active.  The PCIe ECN referenced above provides ability to 
notify the platform in this scenario but requires hardware change.

Likewise the DSP doesn't know or care if it's possible to plug a SAS 
drive into the slot.  As soon as it starts seeing training sets on the 
PCIe link it's going to train regardless of the state of IfDet#/PRSNT#.

Also, the DSP doesn't know the state of IfDet#/PRSNT# directly.  There's 
a pin called PRSNT# in PCIe DSPs (different than the U2 P10 PRSNT#). 
The platform will assert PCIe PRSNT# to the DSP when U.2 PRSNT#/IfDet# 
indicate it's an NVMe drive (and platform doesn't know this until around 
400 ms after IfDet# mates because of the reasons described earlier). 
But PCIe Base spec does not require DSPs to see PCIe PRSNT# asserted in 
order to start the training sequence.  So that's why the PCIe link comes 
up before the platform knows its an NVMe drive.

> 
> Link up means in-band presence, 

Link Up and In-band Presence are slightly different states (see Table 
4.15 in PCIe Base Spec).  In-band presence is a lower-level detection 
mechanism and can be set even when link is not up.

so I assume this host controller already
> adheres to the ECN discussed earlier?  

Yes, this platform disables in-band presence.

What is the rationale of the ECN
> given that it causes these issues?

In certain link states like L1, DSPs can't detect in-band presence going 
away.  If you remove a device while the link is in one of these states 
then OS will not get notified.  OS will not be able to detect subsequent 
insertions either. Only way out of this is for user to go into the OS 
and manually get the link out of L1.

The PCIe spec doesn't have any strict requirements on how long the time 
between DLLSC and PDSC can be but from our measurements 100 ms is not 
long enough to cover the time delta from first pins mating until the 
device is fully inserted. At a platform level, we use 1 second to allow 
for insertion and removal events and this seems adequate for most cases 
but can be defeated if someone makes an effort to slowly insert or 
remove a device.

Is this product being shipped or
> can it still be changed?
> 

It's shipping.  It could be changed with a firmware change to re-enable 
in-band presence but then we'd have the issue where hot-plug breaks when 
removing devices when link is in certain states like L1 so that is not a 
viable solution.

> 
>>> We do allow a 20 + 100 ms delay in pcie_wait_for_link() between link up
>>> and presence detect up, just not 400 ms.
>>
>> Is the hot-inserted device's config space accessed immediately after
>> waiting this 20 + 100 ms delay?
> 
> Yes.
> 
> You can follow the code via this link, click your way down the call stack:
> https://elixir.bootlin.com/linux/v5.0-rc3/source/drivers/pci/hotplug/pciehp_hpc.c#L602
> 
> pciehp_ist()
>    pciehp_handle_presence_or_link_change() /* called on PDC or DLLSC event */
>      pciehp_enable_slot() /* called if PDS or DLLLA is 1 */
>        __pciehp_enable_slot()
>          board_added()
> 	  pciehp_check_link_status()
> 	    pcie_wait_for_link()
> 	      /*
> 	       * wait 20 ms per PCIe r4.0 sec 6.6.1,
> 	       * then wait for up to 1 s until DLLLA is 1,
> 	       * then wait 100 ms per PCIe r4.0 sec 6.7.3.3
> 	       */
> 	    pci_bus_check_dev()
> 	      /*
> 	       * first access to hotplugged device:
> 	       * wait for up to 1 s until vendor ID is accessible
> 	       */
> 	    /*
> 	     * any additionl PDC or DLLSC event that occurred up to this point
> 	     * is ignored because the link is up and config space accessible;
> 	     * read Link Status register and verify link is trained
> 	     */
> 	  pciehp_configure_device()
> 	    /* enumerate hotplugged device */
> 
> The 20 ms delay in pcie_wait_for_link() was added by Keith with commit
> f0157160b359 ("PCI: Make link active reporting detection generic")
> so that the function can also be used by the DPC driver.  I think that
> additional delay wouldn't normally be necessary for hotplug but it
> doesn't hurt much either.
> 
> 
>> Per PCIe spec, in Gen 3 mode, software
>> should wait at least (1s - CTO value) after Data Link Layer Link Active
>> before accessing config space.
> 
> Hm, do you have a spec reference for that?  PCIe r4.0 sec 6.7.3.3 only says:
> 
>     "Software must wait for 100 ms after the Data Link Layer Link Active
>      bit reads 1b before initiating a configuration access to the hot
>      added device (see Section 6.6)."
> 

Next sentence says "Software must allow 1 second after the Data Link 
Layer Link Active bit reads 1b before it is permitted to determine that 
a hot plugged device which fails to return a Successful Completion for a 
Valid Configuration Request is a broken device".

And Section 6.6.1 on Conventional Reset says "Software should use 100 ms 
wait periods only if software enables CRS Software Visibility. 
Otherwise, Completion timeouts, platform timeouts, or lengthy processor 
instruction stalls may result."

The "Completion timeouts" reference here is the key. Since CRS Software 
Visibility is not used, the device still has 900 ms before it has to 
return a Successful Completion if you issue a config read 100 ms after 
DLLSC.

CTO timers are typically set around 50 to 65 ms on x86 systems.  So if 
you issue config read after 100 ms you could see a CTO around 50 to 65 
ms after it's issued which would crash the system.  In order to avoid 
this CTO software has to wait (1 s - CTO value) after DLLSC before 
issuing a config read at Gen 3 (timing is slightly different for Gen 1/2 
speeds). This timing constraint needs to be observed any time a device 
is reset as well.

Probably won't see an issue most of the time because most devices can 
come up and respond to config accesses way faster than 100 ms but I've 
seen devices take longer than that in certain corner cases. Regardless 
of most devices' ability to be ready faster than required by PCIe it's 
probably a good idea to wait longer for hot-insert anyway since 
insertion events can be slow so need time for everything to reach steady 
state (fully inserted) before starting to configure things and 100 ms is 
not enough time for that based on our lab measurements.  Likewise, DLLSC 
and PDSC interrupts can be spread pretty far apart on removal as well 
depending on how slowly the user removes the device.

> >> The exception is if OS enables CRS
>> Software Visibility in which case config space can be accessed 100 ms
>> after Data Link Layer Link Active.  Is CRS Software Visibility being used?
> 
> Not that I'm aware of, no.
> 
> 
>> I forgot... Readiness Notifications can also let software bypass these
>> first access times but I've not seen anybody implement RN so assume it
>> is not being used here.
> 
> Right, I think this isn't used either so far.
> 
> Thanks,
> 
> Lukas
>