Re: Write to srvio_numvfs triggers kernel panic

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Bjorn,

Bjorn Helgaas <helgaas@xxxxxxxxxx> writes:

> [+cc Alex, Leon, Jason]
>
> On Wed, May 04, 2022 at 07:56:01PM +0000, Volodymyr Babchuk wrote:
>> 
>> Hello,
>> 
>> I have encountered issue when PCI code tries to use both fields in
>> 
>>         union {
>> 		struct pci_sriov	*sriov;		/* PF: SR-IOV info */
>> 		struct pci_dev		*physfn;	/* VF: related PF */
>> 	};
>> 
>> (which are part of struct pci_dev) at the same time.
>> 
>> Symptoms are following:
>> 
>> # echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
>> 
>> pci 0000:01:00.2: reg 0x20c: [mem 0x30018000-0x3001ffff 64bit]
>> pci 0000:01:00.2: VF(n) BAR0 space: [mem 0x30018000-0x30117fff 64bit] (contains BAR0 for 32 VFs)
>>  Unable to handle kernel paging request at virtual address 0001000200000010
>>  Mem abort info:
>>    ESR = 0x96000004
>>    EC = 0x25: DABT (current EL), IL = 32 bits
>>    SET = 0, FnV = 0
>>    EA = 0, S1PTW = 0
>>  Data abort info:
>>    ISV = 0, ISS = 0x00000004
>>    CM = 0, WnR = 0
>>  [0001000200000010] address between user and kernel address ranges
>>  Internal error: Oops: 96000004 [#1] PREEMPT SMP
>> Modules linked in: xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6
>> nf_defrag_ipv4 libcrc32c iptable_filter crct10dif_ce nvme nvme_core at24
>> pci_endpoint_test bridge pdrv_genirq ip_tables x_tables ipv6
>>  CPU: 3 PID: 287 Comm: sh Not tainted 5.10.41-lorc+ #233
>>  Hardware name: XENVM-4.17 (DT)
>>  pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
>>  pc : pcie_aspm_get_link+0x90/0xcc
>>  lr : pcie_aspm_get_link+0x8c/0xcc
>>  sp : ffff8000130d39c0
>>  x29: ffff8000130d39c0 x28: 00000000000001a4
>>  x27: 00000000ffffee4b x26: ffff80001164f560
>>  x25: 0000000000000000 x24: 0000000000000000
>>  x23: ffff80001164f660 x22: 0000000000000000
>>  x23: ffff80001164f660 x22: 0000000000000000
>>  x21: ffff000003f08000 x20: ffff800010db37d8
>>  x19: ffff000004b8e780 x18: ffffffffffffffff
>>  x17: 0000000000000000 x16: 00000000deadbeef
>>  x15: ffff8000930d36c7 x14: 0000000000000006
>>  x13: ffff8000115c2710 x12: 000000000000081c
>>  x11: 00000000000002b4 x10: ffff8000115c2710
>>  x9 : ffff8000115c2710 x8 : 00000000ffffefff
>>  x7 : ffff80001161a710 x6 : ffff80001161a710
>>  x5 : ffff00003fdad900 x4 : 0000000000000000
>>  x3 : 0000000000000000 x2 : 0000000000000000
>>  x1 : ffff000003c51c80 x0 : 0001000200000000
>>  Call trace:
>>   pcie_aspm_get_link+0x90/0xcc
>>   aspm_ctrl_attrs_are_visible+0x30/0xc0
>>   internal_create_group+0xd0/0x3cc
>>   internal_create_groups.part.0+0x4c/0xc0
>>   sysfs_create_groups+0x20/0x34
>>   device_add+0x2b4/0x760
>>   pci_device_add+0x814/0x854
>>   pci_iov_add_virtfn+0x240/0x2f0
>>   sriov_enable+0x1f8/0x474
>>   pci_sriov_configure_simple+0x38/0x90
>>   sriov_numvfs_store+0xa4/0x1a0
>>   dev_attr_store+0x1c/0x30
>>   sysfs_kf_write+0x48/0x60
>>   kernfs_fop_write_iter+0x118/0x1ac
>>   new_sync_write+0xe8/0x184
>>   vfs_write+0x23c/0x2a0
>>   ksys_write+0x68/0xf4
>>   __arm64_sys_write+0x20/0x2c
>>   el0_svc_common.constprop.0+0x78/0x1a0
>>   do_el0_svc+0x28/0x94
>>   el0_svc+0x14/0x20
>>   el0_sync_handler+0xa4/0x130
>>   el0_sync+0x180/0x1c0
>> Code: d0002120 9133e000 97ffef8e f9400a60 (f9400813)
>> 
>> 
>> Debugging showed the following:
>> 
>> pci_iov_add_virtfn() allocates new struct pci_dev:
>> 
>> 	virtfn = pci_alloc_dev(bus);
>> and sets physfn:
>> 	virtfn->is_virtfn = 1;
>> 	virtfn->physfn = pci_dev_get(dev);
>> 
>> then we will get into sriov_init() via the following call path:
>> 
>> pci_device_add(virtfn, virtfn->bus);
>>   pci_init_capabilities(dev);
>>     pci_iov_init(dev);
>>       sriov_init(dev, pos);
>
> We called pci_device_add() with the VF.  pci_iov_init() only calls
> sriov_init() if it finds an SR-IOV capability on the device:
>
>   pci_iov_init(struct pci_dev *dev)
>     pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
>     if (pos)
>       return sriov_init(dev, pos);
>
> So this means the VF must have an SR-IOV capability, which sounds a
> little dubious.  From PCIe r6.0:

[...]

Yes, I dived into debugging and came to the same conclusions. I'm still
investigating this, but looks like my PCIe controller (DesignWare-based)
incorrectly reads configuration space for VF. Looks like instead of
providing access VF config space, it reads PF's one.

>
> Can you supply the output of "sudo lspci -vv" for your system?

Sure:

root@spider:~# lspci -vv
00:00.0 PCI bridge: Renesas Technology Corp. Device 0031 (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 189
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        I/O behind bridge: [disabled]
        Memory behind bridge: 30000000-301fffff [size=2M]
        Prefetchable memory behind bridge: [disabled]
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0+,D1+,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable+ Count=128/128 Maskable+ 64bit+
                Address: 0000000004030040  Data: 0000
                Masking: fffffffe  Pending: 00000000
        Capabilities: [70] Express (v2) Root Port (Slot-), MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0
                        ExtTag+ RBE+
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us
                        ClockPM- Surprise- LLActRep+ BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s (ok), Width x2 (ok)
                        TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-
                RootCap: CRSVisible-
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis+, NROPrPrP+, LTR+
                         10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-, LN System CLS Not Supported, TPHComp-, ExtTPHComp-, ARIFwd-
                         AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
                         AtomicOpsCtl: ReqEn- EgressBlck-
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
                RootCmd: CERptEn- NFERptEn- FERptEn-
                RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
                         FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
                ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
        Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [158 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
                LaneErrStat: 0
        Capabilities: [178 v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [19c v1] Lane Margining at the Receiver <?>
        Capabilities: [1bc v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=10us PortTPowerOnTime=14us
                L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Capabilities: [1cc v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
        Capabilities: [2cc v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
        Capabilities: [304 v1] Data Link Feature <?>
        Capabilities: [310 v1] Precision Time Measurement
                PTMCap: Requester:+ Responder:+ Root:+
                PTMClockGranularity: 16ns
                PTMControl: Enabled:- RootSelected:-
                PTMEffectiveGranularity: Unknown
        Capabilities: [31c v1] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
        Kernel driver in use: pcieport
        Kernel modules: pci_endpoint_test

01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824 (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd Device a809
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 0
        NUMA node: 0
        Region 0: Memory at 30010000 (64-bit, non-prefetchable) [size=32K]
        Expansion ROM at 30000000 [virtual] [disabled] [size=64K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [70] Express (v2) Endpoint, MSI 00                                                                                                                               [8/5710]
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR-
                         10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-, TPHComp-, ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [b0] MSI-X: Enable+ Count=64 Masked-
                Vector table: BAR=0 offset=00004000
                PBA: BAR=0 offset=00003000
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [148 v1] Device Serial Number d3-42-50-11-99-38-25-00
        Capabilities: [168 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [178 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
                LaneErrStat: 0
        Capabilities: [198 v1] Physical Layer 16.0 GT/s <?>
        Capabilities: [1c0 v1] Lane Margining at the Receiver <?>
        Capabilities: [1e8 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
                IOVSta: Migration-
                Initial VFs: 32, Total VFs: 32, Number of VFs: 0, Function Dependency Link: 00
                VF offset: 2, stride: 1, Device ID: a824
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 0000000030018000 (64-bit, non-prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [3a4 v1] Data Link Feature <?>
        Kernel driver in use: nvme
        Kernel modules: nvme


> It could be that the device has an SR-IOV capability when it
> shouldn't.  But even if it does, Linux could tolerate that better
> than it does today.
>

Agree there. I can create simple patch that checks for is_virtfn
in sriov_init(). But what to do if it is set?

>> sriov_init() overwrites value in the union:
>> 	dev->sriov = iov; <<<<<---- There
>> 	dev->is_physfn = 1;
>> 

-- 
Volodymyr Babchuk at EPAM



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux