Re: Write to srvio_numvfs triggers kernel panic

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, May 07, 2022 at 10:22:32AM +0000, Volodymyr Babchuk wrote:
> Bjorn Helgaas <helgaas@xxxxxxxxxx> writes:
> > On Wed, May 04, 2022 at 07:56:01PM +0000, Volodymyr Babchuk wrote:
> >> 
> >> I have encountered issue when PCI code tries to use both fields in
> >> 
> >>         union {
> >> 		struct pci_sriov	*sriov;		/* PF: SR-IOV info */
> >> 		struct pci_dev		*physfn;	/* VF: related PF */
> >> 	};
> >> 
> >> (which are part of struct pci_dev) at the same time.
> >> 
> >> Symptoms are following:
> >> 
> >> # echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
> >> 
> >> pci 0000:01:00.2: reg 0x20c: [mem 0x30018000-0x3001ffff 64bit]
> >> pci 0000:01:00.2: VF(n) BAR0 space: [mem 0x30018000-0x30117fff 64bit] (contains BAR0 for 32 VFs)
> >>  Unable to handle kernel paging request at virtual address 0001000200000010

> >> Debugging showed the following:
> >> 
> >> pci_iov_add_virtfn() allocates new struct pci_dev:
> >> 
> >> 	virtfn = pci_alloc_dev(bus);
> >> and sets physfn:
> >> 	virtfn->is_virtfn = 1;
> >> 	virtfn->physfn = pci_dev_get(dev);
> >> 
> >> then we will get into sriov_init() via the following call path:
> >> 
> >> pci_device_add(virtfn, virtfn->bus);
> >>   pci_init_capabilities(dev);
> >>     pci_iov_init(dev);
> >>       sriov_init(dev, pos);
> >
> > We called pci_device_add() with the VF.  pci_iov_init() only calls
> > sriov_init() if it finds an SR-IOV capability on the device:
> >
> >   pci_iov_init(struct pci_dev *dev)
> >     pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
> >     if (pos)
> >       return sriov_init(dev, pos);
> >
> > So this means the VF must have an SR-IOV capability, which sounds a
> > little dubious.  From PCIe r6.0:
> 
> [...]
> 
> Yes, I dived into debugging and came to the same conclusions. I'm still
> investigating this, but looks like my PCIe controller (DesignWare-based)
> incorrectly reads configuration space for VF. Looks like instead of
> providing access VF config space, it reads PF's one.
> 
> > Can you supply the output of "sudo lspci -vv" for your system?
> 
> Sure:
> 
> root@spider:~# lspci -vv
> 00:00.0 PCI bridge: Renesas Technology Corp. Device 0031 (prog-if 00 [Normal decode])
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Interrupt: pin A routed to IRQ 189
>         Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
>         I/O behind bridge: [disabled]
>         Memory behind bridge: 30000000-301fffff [size=2M]
>         Prefetchable memory behind bridge: [disabled]
>         Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
>         BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
>                 PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>         Capabilities: [40] Power Management version 3
>                 Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold+)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [50] MSI: Enable+ Count=128/128 Maskable+ 64bit+
>                 Address: 0000000004030040  Data: 0000
>                 Masking: fffffffe  Pending: 00000000
>         Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
>                 DevCap: MaxPayload 256 bytes, PhantFunc 0
>                         ExtTag+ RBE+
>                 DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
>                         RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
>                 LnkCap: Port #0, Speed 5GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us
>                         ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp+
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed 5GT/s (ok), Width x2 (ok)
>                         TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-
>                 RootCap: CRSVisible-
>                 RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
>                 RootSta: PME ReqID 0000, PMEStatus- PMEPending-
>                 DevCap2: Completion Timeout: Not Supported, TimeoutDis+, NROPrPrP+, LTR+
>                          10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
>                          EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
>                          FRS-, LN System CLS Not Supported, TPHComp-, ExtTPHComp-, ARIFwd-
>                          AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
>                          AtomicOpsCtl: ReqEn- EgressBlck-
>                 LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
>                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>                          Compliance De-emphasis: -6dB
>                 LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
>                          EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>         Capabilities: [100 v2] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>                 AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
>                         MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
>                 HeaderLog: 00000000 00000000 00000000 00000000
>                 RootCmd: CERptEn- NFERptEn- FERptEn-
>                 RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
>                          FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
>                 ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
>         Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
>         Capabilities: [158 v1] Secondary PCI Express
>                 LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
>                 LaneErrStat: 0
>         Capabilities: [178 v1] Physical Layer 16.0 GT/s <?>
>         Capabilities: [19c v1] Lane Margining at the Receiver <?>
>         Capabilities: [1bc v1] L1 PM Substates
>                 L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
>                           PortCommonModeRestoreTime=10us PortTPowerOnTime=14us
>                 L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
>                            T_CommonMode=0us LTR1.2_Threshold=0ns
>                 L1SubCtl2: T_PwrOn=10us
>         Capabilities: [1cc v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
>         Capabilities: [2cc v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
>         Capabilities: [304 v1] Data Link Feature <?>
>         Capabilities: [310 v1] Precision Time Measurement
>                 PTMCap: Requester:+ Responder:+ Root:+
>                 PTMClockGranularity: 16ns
>                 PTMControl: Enabled:- RootSelected:-
>                 PTMEffectiveGranularity: Unknown
>         Capabilities: [31c v1] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
>         Kernel driver in use: pcieport
>         Kernel modules: pci_endpoint_test
> 
> 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
>         Subsystem: Samsung Electronics Co Ltd Device a809
>         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Interrupt: pin A routed to IRQ 0
>         NUMA node: 0
>         Region 0: Memory at 30010000 (64-bit, non-prefetchable) [size=32K]
>         Expansion ROM at 30000000 [virtual] [disabled] [size=64K]
>         Capabilities: [40] Power Management version 3
>                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [70] Express (v2) Endpoint, MSI 00                                                                                                                               [8/5710]
>                 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
>                         ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
>                 DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
>                         RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
>                 LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
>                         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
>                         TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>                 DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
>                          10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
>                          EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
>                          FRS-, TPHComp-, ExtTPHComp-
>                          AtomicOpsCap: 32bit- 64bit- 128bitCAS-
>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
>                          AtomicOpsCtl: ReqEn-
>                 LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
>                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>                          Compliance De-emphasis: -6dB
>                 LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
>                          EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>         Capabilities: [b0] MSI-X: Enable+ Count=64 Masked-
>                 Vector table: BAR=0 offset=00004000
>                 PBA: BAR=0 offset=00003000
>         Capabilities: [100 v2] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>                 AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
>                         MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
>                 HeaderLog: 00000000 00000000 00000000 00000000
>         Capabilities: [148 v1] Device Serial Number d3-42-50-11-99-38-25-00
>         Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
>                 ARICap: MFVC- ACS-, Next Function: 0
>                 ARICtl: MFVC- ACS-, Function Group: 0
>         Capabilities: [178 v1] Secondary PCI Express
>                 LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
>                 LaneErrStat: 0
>         Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
>         Capabilities: [1c0 v1] Lane Margining at the Receiver <?>
>         Capabilities: [1e8 v1] Single Root I/O Virtualization (SR-IOV)
>                 IOVCap: Migration-, Interrupt Message Number: 000
>                 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
>                 IOVSta: Migration-
>                 Initial VFs: 32, Total VFs: 32, Number of VFs: 0, Function Dependency Link: 00
>                 VF offset: 2, stride: 1, Device ID: a824
>                 Supported Page Size: 00000553, System Page Size: 00000001
>                 Region 0: Memory at 0000000030018000 (64-bit, non-prefetchable)
>                 VF Migration: offset: 00000000, BIR: 0
>         Capabilities: [3a4 v1] Data Link Feature <?>
>         Kernel driver in use: nvme
>         Kernel modules: nvme

I guess this is before enabling SR-IOV on 01:00.0, so it doesn't show
the VFs themselves.

> > It could be that the device has an SR-IOV capability when it
> > shouldn't.  But even if it does, Linux could tolerate that better
> > than it does today.
> 
> Agree there. I can create simple patch that checks for is_virtfn
> in sriov_init(). But what to do if it is set?

Maybe something like this?  It makes no sense to me that a VF would
have an SR-IOV capability, but ...

If the below avoids the problem, maybe collect another "lspci -vv"
output including the VF(s).

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 952217572113..9c5184384a45 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -901,6 +901,10 @@ int pci_iov_init(struct pci_dev *dev)
 	if (!pci_is_pcie(dev))
 		return -ENODEV;
 
+	/* Some devices include SR-IOV cap on VFs as well as PFs */
+	if (dev->is_virtfn)
+		return -ENODEV;
+
 	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
 	if (pos)
 		return sriov_init(dev, pos);



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux