On Thu, Jan 28, 2021 at 01:31:00AM +0800, Kai-Heng Feng wrote: > Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in > hint") enables ACS, and some platforms lose its NVMe after resume from > firmware: > [ 50.947816] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000 > [ 50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected > [ 50.947829] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID) > [ 50.947830] pcieport 0000:00:1b.0: device [8086:06ac] error status/mask=00200000/00010000 > [ 50.947831] pcieport 0000:00:1b.0: [21] ACSViol (First) > [ 50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message > [ 50.947843] nvme nvme0: frozen state error detected, reset controller > > It happens right after ACS gets enabled during resume. > > To prevent that from happening, disable AER interrupt and enable it on > system suspend and resume, respectively. Lots of questions here. Maybe this is what we'll end up doing, but I am curious about why the error is reported in the first place. Is this a consequence of the link going down and back up? Is it consequence of the device doing a DMA when it shouldn't? Are we doing something in the wrong order during suspend? Or maybe resume, since I assume the error is reported during resume? If we *do* take the error, why doesn't DPC recovery work? > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209149 > Fixes: 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint") > Signed-off-by: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx> > --- > drivers/pci/pcie/aer.c | 18 ++++++++++++++++++ > 1 file changed, 18 insertions(+) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 77b0f2c45bc0..0e9a85530ae6 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -1365,6 +1365,22 @@ static int aer_probe(struct pcie_device *dev) > return 0; > } > > +static int aer_suspend(struct pcie_device *dev) > +{ > + struct aer_rpc *rpc = get_service_data(dev); > + > + aer_disable_rootport(rpc); > + return 0; > +} > + > +static int aer_resume(struct pcie_device *dev) > +{ > + struct aer_rpc *rpc = get_service_data(dev); > + > + aer_enable_rootport(rpc); > + return 0; > +} > + > /** > * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP > * @dev: pointer to Root Port, RCEC, or RCiEP > @@ -1437,6 +1453,8 @@ static struct pcie_port_service_driver aerdriver = { > .service = PCIE_PORT_SERVICE_AER, > > .probe = aer_probe, > + .suspend = aer_suspend, > + .resume = aer_resume, > .remove = aer_remove, > }; > > -- > 2.29.2 >