On Thu, Jan 11, 2018 at 9:59 AM, Keith Busch <keith.busch@xxxxxxxxx> wrote: > On Thu, Jan 11, 2018 at 06:50:40PM +0100, Maik Broemme wrote: >> I've re-run the test with 4.15rc7.r111.g5f615b97cdea and the following >> patches from Keith: >> >> [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported >> [PATCH 2/4] PCI/AER: Provide API for getting AER information >> [PATCH 3/4] PCI/DPC: Enable DPC in conjuction with AER >> [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling >> >> The issue is still the same. Additionally to the output before I see now: >> >> Jan 11 18:34:45 server.theraso.int kernel: dpc 0000:00:10.0:pcie010: DPC containment event, status:0x1f09 source:0x0000 >> Jan 11 18:34:45 server.theraso.int kernel: dpc 0000:00:10.0:pcie010: DPC unmasked uncorrectable error detected, remove downstream devices >> Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0080(Receiver ID) >> Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0: device [8086:19aa] error status/mask=00000020/00000000 >> Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0: [ 5] Surprise Down Error (First) >> Jan 11 18:34:46 server.theraso.int kernel: nvme0n1: detected capacity change from 1024209543168 to 0 > > Okay, so that series wasn't going to fix anything, but at least it gets > some visibility into what's happened. The DPC was triggered due to a > Surprise Down uncorrectable error, so the power settting is causing the > link to fail. > > The NVMe driver has quirks specifically for this vendor's devices to > fence off NVMe specific automated power settings. Your observations > appear to align with the same issues. Agree. /* * Samsung SSD 960 EVO drops off the PCIe bus after system * suspend on a Ryzen board, ASUS PRIME B350M-A. */ if (dmi_match(DMI_BOARD_VENDOR, "ASUSTeK COMPUTER INC.") && dmi_match(DMI_BOARD_NAME, "PRIME B350M-A")) return NVME_QUIRK_NO_APST; It seems that the attempt to save extrapower using ASPM L1 substates is causing it to fall off. Sorry but I suspect that it may be difficult to debug without a pcie analyzer, some debugging directions can be: - Assuming this is a hotpluggable device, try with another NVMe to verify if the issue is specific to this device. - Can you please try switch the ASPM policy back from "powersupersave" -> powersave, and potentially do a rescan (echo 1 > /sys/bus/pci/rescan), and see if the device comes back (and goes away again when you switch back to supersave)? - May be put some debug prints in pcie_config_aspm_l1ss() to see writing to which register causes the device to fall off (most likely this would be the last statement, but just throwing ideas). - May be dump the timing parameters link->l1ss.ctl1 and link->l1ss.ctl2 from aspm_calc_l1ss_info(), and try to play with them a little.