PCI device hot insert is not detected

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I have a system running Ubuntu 22.04.3 LTS with kernel version
5.15.0-89-generic. There are 10 NVMe drives connected with this
system, attached to the "vfio-pci" driver.

Removed one NVMe drive (pci address 0000:83:00.0), it got unbound
successfully from "vfio-pci" driver but saw below error in the syslog.

can't change power state from D0 to D3hot (config space inaccessible)

Then after 2:30 min approx, re-inserted the same drive to the same PCI
slot. But the drive was not detected.

Kernel log:
Dec 11 23:52:05 node-4 kernel: [183519.565000] pcieport 0000:80:03.2:
pciehp: Slot(14): Link Down
Dec 11 23:52:05 node-4 kernel: [183519.565020] vfio-pci 0000:83:00.0:
Relaying device request to user (#0)
Dec 11 23:52:05 node-4 kernel: [183519.567467] vfio-pci 0000:83:00.0:
vfio_bar_restore: reset recovery - restoring BARs
Dec 11 23:52:06 node-4 kernel: [183519.629302] vfio-pci 0000:83:00.0:
can't change power state from D0 to D3hot (config space inaccessible)
Dec 11 23:52:06 node-4 kernel: [183519.639070] pci 0000:83:00.0:
Removing from iommu group 41
Dec 11 23:54:39 node-4 kernel: [183672.630191] pcieport 0000:80:03.2:
pciehp: Slot(14): Attention button pressed
Dec 11 23:54:39 node-4 kernel: [183672.630195] pcieport 0000:80:03.2:
pciehp: Slot(14) Powering on due to button press
Dec 11 23:54:44 node-4 kernel: [183677.671931] pcieport 0000:80:03.2:
pciehp: Slot(14): Card present
Dec 11 23:54:46 node-4 kernel: [183679.783922] pcieport 0000:80:03.2:
pciehp: Slot(14): No link
Dec 12 00:09:17 node-4 kernel: [184550.808980] pcieport 0000:80:03.2:
pciehp: Slot(14): Attention button pressed
Dec 12 00:09:17 node-4 kernel: [184550.808991] pcieport 0000:80:03.2:
pciehp: Slot(14) Powering on due to button press
Dec 12 00:09:22 node-4 kernel: [184556.025139] pcieport 0000:80:03.2:
pciehp: Slot(14): Card present
Dec 12 00:09:24 node-4 kernel: [184558.189151] pcieport 0000:80:03.2:
pciehp: Slot(14): No link

lspci output:
 +-[0000:80]-+-00.0  Advanced Micro Devices, Inc. [AMD]
Starship/Matisse Root Complex
 |           +-00.2  Advanced Micro Devices, Inc. [AMD] Milan IOMMU
 |           +-01.0  Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
 |           +-01.1-[81]--+-00.0  Mellanox Technologies MT28908 Family
[ConnectX-6]
 |           |            \-00.1  Mellanox Technologies MT28908 Family
[ConnectX-6]
 |           +-02.0  Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
 |           +-03.0  Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
 |           +-03.1-[82]----00.0  Samsung Electronics Co Ltd NVMe SSD
Controller PM9A1/PM9A3/980PRO
 |           +-03.2-[83]--
 |           +-03.3-[84]--
 |           +-03.4-[85]--
 |           +-04.0  Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
 |           +-05.0  Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
 |           +-07.0  Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
 |           +-07.1-[86]--+-00.0  Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Function
 |           |            \-00.2  Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PTDMA
 |           +-08.0  Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
 |           \-08.1-[87]--+-00.0  Advanced Micro Devices, Inc. [AMD]
Starship/Matisse Reserved SPP
 |                        \-00.2  Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PTDMA
 +-[0000:40]-+-00.0  Advanced Micro Devices, Inc. [AMD]
Starship/Matisse Root Complex

Since the slot 14 link was down, Although the drive was physically
present, the value of power remained 0 in the sysfs, even echo 1 to
this power was also not working here.

admin@node-4:/sys/bus/pci/slots/14$ cat address
0000:83:00
admin@node-4:~$
admin@node-4:/sys/bus/pci/slots/14$ cat power
0
admin@node-4:/sys/bus/pci/slots/14$
admin@node-4:/sys/bus/pci/slots/14$ sudo echo 1 > power
echo: write error: Operation not permitted
admin@node-4:/sys/bus/pci/slots/14$

Dec 12 09:18:09 node-4 kernel: [217484.101870] pcieport 0000:80:03.2:
pciehp: Slot(14): Card present
Dec 12 09:18:12 node-4 kernel: [217486.272077] pcieport 0000:80:03.2:
pciehp: Slot(14): No link


But after system reboot, the drive detected successfully. So, can I
get some insight like why the drive was not detecting before the
system reboot ?

Here are some system details:

lspci tree:
admin@node-4:~$ sudo lspci -t -vvv
 +-[0000:80]-+-00.0  Advanced Micro Devices, Inc. [AMD]
Starship/Matisse Root Complex
 |           +-00.2  Advanced Micro Devices, Inc. [AMD] Milan IOMMU
 |           +-01.0  Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
 |           +-01.1-[81]--+-00.0  Mellanox Technologies MT28908 Family
[ConnectX-6]
 |           |            \-00.1  Mellanox Technologies MT28908 Family
[ConnectX-6]
 |           +-02.0  Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
 |           +-03.0  Advanced Micro Devices, Inc. [AMD]
Starship/Matisse PCIe Dummy Host Bridge
 |           +-03.1-[82]----00.0  Samsung Electronics Co Ltd NVMe SSD
Controller PM9A1/PM9A3/980PRO
 |           +-03.2-[83]----00.0  Samsung Electronics Co Ltd NVMe SSD
Controller PM9A1/PM9A3/980PRO

lspci output of the said drive:
83:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd
NVMe SSD Controller PM9A1/PM9A3/980PRO (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd General DC NVMe PM9A3
        Physical Slot: 14
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 469
        NUMA node: 2
        IOMMU group: 41
        Region 0: Memory at f2410000 (64-bit, non-prefetchable) [size=16K]
        Expansion ROM at f2400000 [disabled] [size=64K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+
FLReset+ SlotPowerLimit 75.000W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq-
AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 16GT/s (ok), Width x4 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
NROPrPrP- LTR+
                         10BitTagComp+ 10BitTagReq- OBFF Not
Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported,
EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
LTR+ OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink-
Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB,
EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+
LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: Upstream Port
        Capabilities: [b0] MSI-X: Enable+ Count=130 Masked-
                Vector table: BAR=0 offset=00003000
                PBA: BAR=0 offset=00002000
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt-
UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
AdvNonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
AdvNonFatalErr-
                AERCap: First Error Pointer: 00, ECRCGenCap+
ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
                        MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [178 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [1bc v1] Lane Margining at the Receiver <?>
        Capabilities: [3a0 v1] Data Link Feature <?>
        Kernel driver in use: vfio-pci
        Kernel modules: nvme

lspci output of the pci port:
80:03.2 PCI bridge: Advanced Micro Devices, Inc. [AMD]
Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin ? routed to IRQ 41
        NUMA node: 2
        IOMMU group: 29
        Bus: primary=80, secondary=83, subordinate=83, sec-latency=0
        I/O behind bridge: 0000f000-00000fff [disabled]
        Memory behind bridge: f2400000-f24fffff [size=1M]
        Prefetchable memory behind bridge:
00000180a0400000-00000180a05fffff [size=2M]
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort+ <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16+ MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0
                        ExtTag+ RBE+
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq-
AuxPwr- TransPend-
                LnkCap: Port #2, Speed 16GT/s, Width x4, ASPM L1, Exit
Latency L1 <64us
                        ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 16GT/s (ok), Width x4 (ok)
                        TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
                SltCap: AttnBtn+ PwrCtrl+ MRL- AttnInd+ PwrInd+
HotPlug+ Surprise+
                        Slot #14, PowerLimit 75.000W; Interlock+ NoCompl-
                SltCtl: Enable: AttnBtn+ PwrFlt- MRL- PresDet-
CmdCplt+ HPIrq+ LinkChg+
                        Control: AttnInd Off, PwrInd On, Power- Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt-
PresDet+ Interlock-
                        Changed: MRL- PresDet- LinkState-
                RootCap: CRSVisible+
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal-
PMEIntEna+ CRSVisible+
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
NROPrPrP- LTR+
                         10BitTagComp+ 10BitTagReq+ OBFF Not
Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
                         EmergencyPowerReduction Not Supported,
EmergencyPowerReductionInit-
                         FRS- LN System CLS Not Supported, TPHComp+
ExtTPHComp- ARIFwd+
                         AtomicOpsCap: Routing+ 32bit+ 64bit+ 128bitCAS-
                DevCtl2: Completion Timeout: 65ms to 210ms,
TimeoutDis- LTR+ OBFF Disabled, ARIFwd+
                         AtomicOpsCtl: ReqEn- EgressBlck-
                LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink-
Retimer+ 2Retimers+ DRS-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB,
EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+
LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee00000  Data: 0000
        Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc.
[AMD] Starship/Matisse GPP Bridge
        Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
        Capabilities: [100 v1] Vendor Specific Information: ID=0001
Rev=1 Len=010 <?>
        Capabilities: [150 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO+ CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt-
UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol+
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout-
AdvNonFatalErr-
                AERCap: First Error Pointer: 00, ECRCGenCap+
ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
                RootCmd: CERptEn- NFERptEn- FERptEn-
                RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
                         FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
                ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
        Capabilities: [270 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [2a0 v1] Access Control Services
                ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+
UpstreamFwd+ EgressCtrl- DirectTrans+
                ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+
UpstreamFwd+ EgressCtrl- DirectTrans-
        Capabilities: [370 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2-
ASPM_L1.1+ L1_PM_Substates+
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                L1SubCtl2:
        Capabilities: [380 v1] Downstream Port Containment
                DpcCap: INT Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP
PIO Log 6, DL_ActiveErr+
                DpcCtl: Trigger:0 Cmpl- INT- ErrCor- PoisonedTLP-
SwTrigger- DL_ActiveErr-
                DpcSta: Trigger- Reason:00 INT- RPBusy- TriggerExt:00
RP PIO ErrPtr:1f
                Source: 0000
        Capabilities: [400 v1] Data Link Feature <?>
        Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [440 v1] Lane Margining at the Receiver <?>
        Kernel driver in use: pcieport

admin@node-4:~$ sudo inxi -c 42
CPU: 64-core AMD EPYC 7713P (-MCP-) speed/min/max: 2000/1500/3721 MHz
Kernel: 5.15.0-89-generic x86_64 Up: 12m Mem: 565419.6/1019777.2 MiB (55.4%)

Regards,
Ashutosh




[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux