For #2, you mean "S3 suspend"? My normally solid machine is hit-or- miss on coming out of S3 suspend lately but that could be related to me running Linus' master. Should I try out "Dubious Patch #1"? I lose a %age/hr in S3. -Kenny On Wed, 23 Nov 2022, Matthias Kaehlcke wrote: > Hi, > > not sure this is the best thread to reply to, but we are also > observing suspend issues with the same Kioxia NVMe on a platform > with a Qualcomm sc7280 SoC. The system runs a v5.15 downstream > kernel which includes most (post v5.15) ASPM patches from > upstream. > > There are two issues with the Kioxia NVMe: > > 1. Power consumption is high unless a LTR_L1.2_THRESHOLD of 0ns > is configured (related dubious patch: [1]) > > 2. The system often hangs on resume unless a longer delay is > added in the suspend pass. QC engineers say that the NVMe is > taking so much time to settle in L1ss. > > Other NVMe models don't exhibit power or suspend issues on this > platform, except for one model which also needs a shorter > delay during suspend, otherwise the system will hang > occasionally upon resume. > > The second issue could possibly be 'fixed' with a quirk for > the Kioxia NVMe model, though it seems the issue is not seen on > all platforms, apparently the delay is not needed on Kenny's > system. > > I'm currently a bit at a loss with the first issue. The patch > mentioned above claims that the (no-)snoop latencies are > involved, which may or may not be true. In this thread I saw > Kenny posting 'lspci' output from his (now) working system. > I noticed max (no-)snoop values of 3145728ns, which seems to > be some sort of default (programmed) max. On my system these > values are 0ns, which is the default value of the registers. > I tried to set these to 3146us from the kernel to see if that > makes a difference, but could only successfully update the max > snoop latency, but not non-snoop (maybe this can be only done > at early initialization time?). With just the max snoop latency > set to 3146us power consumption of the NVMe remains high. > > The output of lspci from my system is attached. > > In this thread it was mentioned that possibly a BIOS update > fixed the issue Kenny was seeing. What kind of values is > the BIOS supposed to adjust (I'm a PCI n00b)? > > Any suggestions about what else to try? > > Thanks > > m. > > [1] https://patchwork.kernel.org/project/linux-arm-msm/patch/1663315719-21563-1-git-send-email-quic_krichai@xxxxxxxxxxx/ > > --- > > 0001:00:00.0 PCI bridge: Qualcomm Device 010b (prog-if 00 [Normal decode]) > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0 > Interrupt: pin A routed to IRQ 183 > IOMMU group: 0 > Region 0: Memory at 40700000 (32-bit, non-prefetchable) [size=4K] > Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0 > I/O behind bridge: 00001000-00001fff [size=4K] > Memory behind bridge: 40300000-404fffff [size=2M] > Prefetchable memory behind bridge: 0000000040500000-00000000406fffff [size=2M] > Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- > BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B- > PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- > Capabilities: [40] Power Management version 3 > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- > Capabilities: [50] MSI: Enable+ Count=1/32 Maskable+ 64bit+ > Address: 00000000fffff000 Data: 0000 > Masking: fffffffe Pending: 00000000 > Capabilities: [70] Express (v2) Root Port (Slot+), MSI 00 > DevCap: MaxPayload 128 bytes, PhantFunc 0 > ExtTag- RBE+ > DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ > MaxPayload 128 bytes, MaxReadReq 512 bytes > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend- > LnkCap: Port #0, Speed 8GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <1us, L1 <64us > ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp+ > LnkCtl: ASPM L0s L1 Enabled; RCB 128 bytes, Disabled- CommClk+ > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 8GT/s (ok), Width x2 (ok) > TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt- > SltCap: AttnBtn+ PwrCtrl+ MRL+ AttnInd+ PwrInd+ HotPlug+ Surprise+ > Slot #0, PowerLimit 0.000W; Interlock+ NoCompl- > SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg- > Control: AttnInd Off, PwrInd Off, Power- Interlock- > SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock- > Changed: MRL- PresDet- LinkState- > RootCap: CRSVisible- > RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible- > RootSta: PME ReqID 0000, PMEStatus- PMEPending- > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP+ LTR+ > 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix- > EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- > FRS- LN System CLS Not Supported, TPHComp+ ExtTPHComp- ARIFwd- > AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS- > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled, ARIFwd- > AtomicOpsCtl: ReqEn- EgressBlck- > LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS- > LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- > Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- > Compliance De-emphasis: -6dB > LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ > EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- > Retimer- 2Retimers- CrosslinkRes: unsupported > Capabilities: [100 v2] Advanced Error Reporting > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- > CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- > CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ > AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- > MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- > HeaderLog: 00000000 00000000 00000000 00000000 > RootCmd: CERptEn- NFERptEn- FERptEn- > RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd- > FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0 > ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000 > Capabilities: [148 v1] Secondary PCI Express > LnkCtl3: LnkEquIntrruptEn- PerformEqu- > LaneErrStat: 0 > Capabilities: [168 v1] Transaction Processing Hints > No steering table available > Capabilities: [1fc v1] L1 PM Substates > L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ > PortCommonModeRestoreTime=70us PortTPowerOnTime=0us > L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- > T_CommonMode=70us LTR1.2_Threshold=86016ns > L1SubCtl2: T_PwrOn=10us > Kernel driver in use: pcieport > > 0001:01:00.0 Non-Volatile memory controller: KIOXIA Corporation Device 0001 (prog-if 02 [NVM Express]) > Subsystem: KIOXIA Corporation Device 0001 > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- > Latency: 0 > Interrupt: pin A routed to IRQ 182 > IOMMU group: 0 > Region 0: Memory at 40300000 (64-bit, non-prefetchable) [size=16K] > Capabilities: [40] Express (v2) Endpoint, MSI 00 > DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W > DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset- > MaxPayload 128 bytes, MaxReadReq 512 bytes > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- > LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <2us, L1 <32us > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ > LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes, Disabled- CommClk+ > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 8GT/s (ok), Width x2 (downgraded) > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- > DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+ > 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix- > EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- > FRS- TPHComp- ExtTPHComp- > AtomicOpsCap: 32bit- 64bit- 128bitCAS- > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled, > AtomicOpsCtl: ReqEn- > LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS- > LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- > Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- > Compliance De-emphasis: -6dB > LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ > EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- > Retimer- 2Retimers- CrosslinkRes: unsupported > Capabilities: [80] Power Management version 3 > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-) > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- > Capabilities: [90] MSI: Enable- Count=1/32 Maskable+ 64bit+ > Address: 0000000000000000 Data: 0000 > Masking: 00000000 Pending: 00000000 > Capabilities: [b0] MSI-X: Enable+ Count=32 Masked- > Vector table: BAR=0 offset=00002000 > PBA: BAR=0 offset=00003000 > Capabilities: [100 v2] Advanced Error Reporting > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- > CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- > CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ > AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- > MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- > HeaderLog: 00000000 00000000 00000000 00000000 > Capabilities: [150 v1] Virtual Channel > Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 > Arb: Fixed- WRR32- WRR64- WRR128- > Ctrl: ArbSelect=Fixed > Status: InProgress- > VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- > Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- > Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff > Status: NegoPending- InProgress- > Capabilities: [260 v1] Latency Tolerance Reporting > Max snoop latency: 0ns > Max no snoop latency: 0ns > Capabilities: [300 v1] Secondary PCI Express > LnkCtl3: LnkEquIntrruptEn- PerformEqu- > LaneErrStat: 0 > Capabilities: [400 v1] L1 PM Substates > L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+ > PortCommonModeRestoreTime=60us PortTPowerOnTime=10us > L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- > T_CommonMode=0us LTR1.2_Threshold=86016ns > L1SubCtl2: T_PwrOn=10us > Kernel driver in use: nvme > > > On Tue, Apr 12, 2022 at 05:50:47PM -0500, Bjorn Helgaas wrote: > > [+cc Ricky for rtsx_pci ASPM behavior, Rajat, Prasad for L1 SS stuff, > > Victor for interest in disabling ASPM during save/restore] > > > > On Wed, Feb 16, 2022 at 06:41:39PM +0530, Vidya Sagar wrote: > > > On 2/16/2022 11:30 AM, Kenneth R. Crudup wrote: > > > > On Wed, 16 Feb 2022, Vidya Sagar wrote: > > > > > > > > > I see that the ASPM-L1 state of Realtek NIC which was in > > > > > disabled state before hibernate got enabled after hibernate. > > > > > > > > That's actually my SD-Card reader; there's a good chance the BIOS > > > > does "something" to it at boot time, as it's possible to boot from > > > > SD-Card on my laptop. > > > > > > > > > This patch doesn't do anything to LnkCtl register which has > > > > > control for ASPM L1 state. > > > > > > > > > Could you please check why ASPM L1 got enabled post hibernation? > > > > > > > > I wouldn't know how to do that; if you're still interested in that > > > > let me know what to do to determine that. > > > > > I would like Bjorn to take a call on it. > > > At this point, there are contradictions in observations. > > > > Remind me what contradictions you see? I know Kenny saw NVMe errors > > on a kernel that included 4257f7e008ea ("PCI/ASPM: Save/restore L1SS > > Capability for suspend/resume") in December 2020 [1], and that he did > > *not* see those errors on 4257f7e008ea in February 2022 [2]. Is that > > what you mean? > > > > > Just to summarize, > > > - The root ports in your laptop don't have support for L1SS > > > - With the same old code base with which the errors were observed plus my > > > patch on top of it, I see that ASPM-L1 state getting enabled for one of the > > > endpoints (Realtek SD-Card reader) after system comes out of hibernation > > > even though ASPM-L1 was disabled before the system enter into hibernation. > > > No errors are reported now. > > > > I assume you refer to [2], where on 4257f7e008ea ("PCI/ASPM: > > Save/restore L1SS Capability for suspend/resume"), Kenny saw ASPM L1 > > disabled before hibernate and enabled afterwards: > > > > --- pre-hibernate > > +++ post-hibernate > > 00:1d.7 PCI bridge [0604]: Intel [8086:34b7] > > Bus: primary=00, secondary=58, subordinate=58 > > LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ > > 58:00.0 RTS525A PCI Express Card Reader [10ec:525a] > > - LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk- > > - ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > > + LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk- > > + ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- > > > > Per PCIe r6.0, sec 7.5.3.7, "ASPM L1 must be enabled by software in > > the Upstream component on a Link prior to enabling ASPM L1 in the > > Downstream component on that Link," so this definitely seems broken, > > but wouldn't explain the NVMe issue. > > > > The PCI core (pcie_config_aspm_link()) always enables L1 in the > > upstream component before the downstream one, but 58:00.0 uses the > > rtsx_pci driver, which does a lot of its own ASPM fiddling, so my > > guess is that it's doing something wrong here. > > > > > - With the linux-next top of the tree plus my patch, no change in the ASPM > > > states and no errors also reported. > > > > I don't know which report this refers to. > > > > > This points to BIOS being buggy (both old and new with new one being less > > > problematic) > > > > I agree that a BIOS change between [1] and [2] seems plausible, but I > > don't think we can prove that yet. I'm slightly queasy because while > > Kenny may have updated his BIOS, most people will not have. > > > > I think we should try this patch again with some changes and maybe > > some debug logging: > > > > - I wonder if we should integrate the LTR, L1 SS, and Link Control > > ASPM restore instead of having them spread around through > > pci_restore_ltr_state(), pci_restore_aspm_l1ss_state(), and > > pci_restore_pcie_state(). Maybe a new pci_restore_aspm() that > > would be called from pci_restore_pcie_state()? > > > > - For L1 PM Substates configuration, sec 5.5.4 says that both ports > > must be configured while ASPM L1 is disabled, but I don't think we > > currently guarantee this: we restore all the upstream component > > state first, and we don't know the ASPM state of the downstream > > one. Maybe we need to: > > > > * When restoring upstream component, > > + disable its ASPM > > > > * When restoring downstream component, > > + disable its ASPM > > + restore upstream component's LTR, L1SS > > + restore downstream component's LTR, L1SS > > + restore upstream component's ASPM > > + restore downstream component's ASPM > > > > This seems pretty messy, but seems like what the spec requires. > > > > - Add some pci_dbg() logging of all these save/restore values to > > help debug any issues. > > > > Bjorn > > > > [1] https://lore.kernel.org/r/20201228040513.GA611645@bjorn-Precision-5520 > > [2] https://lore.kernel.org/r/3ca14a7-b726-8430-fe61-a3ac183a1088@xxxxxxxxx > > > > -- Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange County CA