Re: [RFC] Resizable BARs vs bridges with BARs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Christian,

On Sat, 19 Nov 2022 20:14:15 +0100
Christian König <christian.koenig@xxxxxxx> wrote:
> Am 19.11.22 um 15:07 schrieb Alex Williamson:
> > On Sat, 19 Nov 2022 12:02:55 +0100
> > Christian König <christian.koenig@xxxxxxx> wrote:
> >> Am 19.11.22 um 00:09 schrieb Alex Williamson:  
> >>> I'm trying to get resizable BARs working in a configuration where my
> >>> root bus resources provide plenty of aperture for the BAR:
> >>>
> >>> pci_bus 0000:5d: root bus resource [io  0x8000-0x9fff window]
> >>> pci_bus 0000:5d: root bus resource [mem 0xb8800000-0xc5ffffff window]
> >>> pci_bus 0000:5d: root bus resource [mem 0xb000000000-0xbfffffffff window] <<<
> >>> pci_bus 0000:5d: root bus resource [bus 5d-7f]
> >>>
> >>> But resizing fails with -ENOSPC.  The topology looks like this:
> >>>
> >>>    +-[0000:5d]-+-00.0-[5e-61]----00.0-[5f-61]--+-01.0-[60]----00.0  Intel Corporation DG2 [Arc A380]
> >>>                                                \-04.0-[61]----00.0  Intel Corporation Device 4f92
> >>>
> >>> The BIOS is not fluent in resizable BARs and only programs the root
> >>> port with a small aperture:
> >>>
> >>> 5d:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 07) (prog-if 00 [Normal decode])
> >>>           Bus: primary=5d, secondary=5e, subordinate=61, sec-latency=0
> >>>           I/O behind bridge: 0000f000-00000fff [disabled]
> >>>           Memory behind bridge: b9000000-ba0fffff [size=17M]
> >>>           Prefetchable memory behind bridge: 000000bfe0000000-000000bff07fffff [size=264M]
> >>>           Kernel driver in use: pcieport
> >>>
> >>> The trouble comes on the upstream PCIe switch port:
> >>>
> >>> 5e:00.0 PCI bridge: Intel Corporation Device 4fa1 (rev 01) (prog-if 00 [Normal decode])  
> >>>      >>>  Region 0: Memory at b010000000 (64-bit, prefetchable)  
> >>>           Bus: primary=5e, secondary=5f, subordinate=61, sec-latency=0
> >>>           I/O behind bridge: 0000f000-00000fff [disabled]
> >>>           Memory behind bridge: b9000000-ba0fffff [size=17M]
> >>>           Prefetchable memory behind bridge: 000000bfe0000000-000000bfefffffff [size=256M]
> >>>           Kernel driver in use: pcieport
> >>>
> >>> Note region 0 of this bridge, which is 64-bit, prefetchable and
> >>> therefore conflicts with the same type for the resizable BAR on the GPU:
> >>>
> >>> 60:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A380] (rev 05) (prog-if 00 [VGA controller])
> >>>           Region 0: Memory at b9000000 (64-bit, non-prefetchable) [disabled] [size=16M]
> >>>           Region 2: Memory at bfe0000000 (64-bit, prefetchable) [disabled] [size=256M]
> >>>           Expansion ROM at <ignored> [disabled]
> >>>           Capabilities: [420 v1] Physical Resizable BAR
> >>>                   BAR 2: current size: 256MB, supported: 256MB 512MB 1GB 2GB 4GB 8GB
> >>>
> >>> It's a shame that the hardware designers didn't mark the upstream port
> >>> BAR as non-prefetchable to avoid it living in the same resource
> >>> aperture as the resizable BAR on the downstream device.  
> >> This is expected. Bridges always have a 32bit non prefetchable and a
> >> 64bit prefetchable BAR. This is part of the PCI(e) spec.  
> > To be clear, the issue is a bridge implementing a 64-bit, prefetchable
> > BAR at config offset 0x10 & 0x14, not the limit/base registers that
> > define the bridge windows for prefetchable and non-prefetchable
> > downstream resources.  
> 
> WHAT? I've never heard of a bridge with this configuration. I don't 
> fully remember the spec, but I'm pretty sure that this isn't something 
> standard.

Type1 config space allows for two standard BARs.

> Can you give me the output of "sudo lspci -vvvv -s $busID" for this device.

5e:00.0 PCI bridge: Intel Corporation Device 4fa1 (rev 01) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 42
	NUMA node: 0
	IOMMU group: 1
	Region 0: Memory at bff0000000 (64-bit, prefetchable) [size=8M]
	Bus: primary=5e, secondary=5f, subordinate=61, sec-latency=0
	I/O behind bridge: 0000f000-00000fff [disabled]
	Memory behind bridge: b9000000-ba0fffff [size=17M]
	Prefetchable memory behind bridge: 000000bfe0000000-000000bfefffffff [size=256M]
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity+ SERR+ NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
		Address: 0000000000000000  Data: 0000
		Masking: 00000000  Pending: 00000000
	Capabilities: [70] Express (v2) Upstream Port, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ SlotPowerLimit 75.000W
		DevCtl:	CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (downgraded), Width x8 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP+ LTR+
			 10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS+
			 AtomicOpsCap: Routing+ 32bit+ 64bit+ 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
			 AtomicOpsCtl: EgressBlck+
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS+
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: Upstream Port
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [148 v1] Power Budgeting <?>
	Capabilities: [158 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [178 v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [1a0 v1] Lane Margining at the Receiver <?>
	Capabilities: [1d4 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [1dc v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=10us PortTPowerOnTime=14us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us
	Capabilities: [1f8 v1] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
	Capabilities: [2f8 v1] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
	Capabilities: [330 v1] Data Link Feature <?>
	Kernel driver in use: pcieport

> >>> In any case, it's my understanding that our bridge drivers don't generally make use
> >>> of bridge BARs.  I think we can test whether a driver has done a
> >>> pci_request_region() or equivalent by looking for the IORESOURCE_BUSY
> >>> flag, but I also suspect this is potentially racy.  
> >> That sounds like we have a misunderstanding here how those bridges work.
> >> The upstream bridges should include all the resources of the downstream
> >> devices/bridges in their BARs.  
> > Correct, and the issue is that the bridge at 5e:00.0 _consumes_ a
> > portion of the window we need to resize at the root port.
> >
> > Root port:
> > Prefetchable memory behind bridge: 000000bfe0000000-000000bff07fffff [size=264M]
> >
> > Upstream switch port:
> > Region 0: Memory at b010000000 (64-bit, prefetchable)
> > Prefetchable memory behind bridge: 000000bfe0000000-000000bfefffffff [size=256M]
> >
> > It's that Region 0 resource that prevents resizing.  
> 
> Could it be that some of the ACPI tables are broken and because of this 
> we add a fixed resource to this device?

The switch is part of a plug-in card, I'd not expect ACPI to be
involved.  It's just a standard BAR:

# setpci -s 5e:00.0 BASE_ADDRESS_0
f000000c
# setpci -s 5e:00.0 BASE_ADDRESS_1
000000bf
# setpci -s 5e:00.0 BASE_ADDRESS_0=ffffffff
# setpci -s 5e:00.0 BASE_ADDRESS_1=ffffffff
# setpci -s 5e:00.0 BASE_ADDRESS_0
ff80000c
# setpci -s 5e:00.0 BASE_ADDRESS_1
ffffffff

All this would have transparently worked if they would have chosen to
implement a non-prefetchable BAR.

> Otherwise I have a hard time coming up with a way for a bridge to have a 
> BAR in the config space.

It's a standard part of the Type1 config header.

> >>> The patch below works for me, allowing the new resourceN_resize sysfs
> >>> attribute to resize the root port window within the provided bus
> >>> window.  Is this the right answer?  How can we make it feel less
> >>> sketchy?  Thanks,  
> >> The correct approach is to remove all the drivers (EFI, vesafb etc...)
> >> which are using the PCI(e) devices under the bridge in question. Then
> >> release the resources and puzzle everything back together.
> >>
> >> See amdgpu_device_resize_fb_bar() how to do this correctly.  
> > Resource resizing in pci-sysfs is largely modeled after the amdgpu
> > code, but I don't see any special provisions for handling conflicting
> > resources consumed on intermediate devices.  The driver attached to the
> > upstream switch port is pcieport and removing it doesn't resolve the
> > problem.  The necessary resource on the root port still reports a
> > child.
> >
> > Is amdgppu resizing known to work in cases where the GPU is downstream
> > of a PCIe switch that consumes resources of the same type and the root
> > port aperture needs to be resized?  I suspect it does not.  Thanks,  
> 
> Well we have the possibility to add extra space to bridges on the kernel 
> command line for this.
> 
> This is used for things like hotplug behind bridges with limited address 
> space.

AFAIK, this is only for hotplug slots, my root port is HotPlug-.

I'd also like to make pci=realloc aware of resizable BARs, but it hits
the same problem.  Thanks,

Alex





[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux