On 2021/4/28 22:40, Lukas Wunner wrote: > On Wed, Apr 28, 2021 at 06:08:02PM +0800, Yicong Yang wrote: >> I've tested the patch on our board, but the hotplug will still be >> triggered sometimes. >> seems the hotplug doesn't find the link down event is caused by dpc. >> Any further test I can do? >> >> mestuary:/$ [12508.408576] pcieport 0000:00:10.0: DPC: containment event, status:0x1f21 source:0x0000 >> [12508.423016] pcieport 0000:00:10.0: DPC: unmasked uncorrectable error detected >> [12508.434277] pcieport 0000:00:10.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Completer ID) >> [12508.447651] pcieport 0000:00:10.0: device [19e5:a130] error status/mask=00008000/04400000 >> [12508.458279] pcieport 0000:00:10.0: [15] CmpltAbrt (First) >> [12508.467094] pcieport 0000:00:10.0: AER: TLP Header: 00000000 00000000 00000000 00000000 >> [12511.152329] pcieport 0000:00:10.0: pciehp: Slot(0): Link Down > > Note that about 3 seconds pass between DPC trigger and hotplug link down > (12508 -> 12511). That's most likely the 3 second timeout in my patch: > > + /* > + * Need a timeout in case DPC never completes due to failure of > + * dpc_wait_rp_inactive(). > + */ > + wait_event_timeout(dpc_completed_waitqueue, dpc_completed(pdev), > + msecs_to_jiffies(3000)); > > If DPC doesn't recover within 3 seconds, pciehp will consider the > error unrecoverable and bring down the slot, no matter what. > > I can't tell you why DPC is unable to recover. Does it help if you > raise the timeout to, say, 5000 msec? > I raise the timeout to 4s and it works well. I dump the remained jiffies in the log and find sometimes the recovery will take a bit more than 3s: [ 826.564141] pcieport 0000:00:10.0: DPC: containment event, status:0x1f01 source:0x0000 [ 826.579790] pcieport 0000:00:10.0: DPC: unmasked uncorrectable error detected [ 826.591881] pcieport 0000:00:10.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Completer ID) [ 826.608137] pcieport 0000:00:10.0: device [19e5:a130] error status/mask=00008000/04400000 [ 826.620888] pcieport 0000:00:10.0: [15] CmpltAbrt (First) [ 826.638742] pcieport 0000:00:10.0: AER: TLP Header: 00000000 00000000 00000000 00000000 [ 828.955313] pcieport 0000:00:10.0: DPC: dpc_reset_link: begin reset [ 829.719875] pcieport 0000:00:10.0: DPC: DPC reset has been finished. [ 829.731449] pcieport 0000:00:10.0: DPC: remaining time for waiting dpc compelete: 0xd0 <-------- 208 jiffies remained [ 829.732459] ixgbe 0000:01:00.0: enabling device (0000 -> 0002) [ 829.744535] pcieport 0000:00:10.0: pciehp: Slot(0): Link Down/Up ignored (recovered by DPC) [ 829.993188] ixgbe 0000:01:00.1: enabling device (0000 -> 0002) [ 830.760190] pcieport 0000:00:10.0: AER: device recovery successful [ 831.013197] ixgbe 0000:01:00.0 eth0: detected SFP+: 5 [ 831.164242] ixgbe 0000:01:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX [ 831.827845] ixgbe 0000:01:00.0 eth0: NIC Link is Down [ 833.381018] ixgbe 0000:01:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: RX/TX CONFIG_HZ=250 so remaining jiffies should larger than 250 if the recovery finished in 3s. Is there a reference to the 3s timeout? and does it make sense to raise it a little bit? Thanks, Yicong > Thanks, > > Lukas > > . >