On Thu, Feb 25, 2016 at 12:07:50PM +0000, Lorenzo Pieralisi wrote: > On Thu, Feb 25, 2016 at 03:01:19AM +0000, Gabriele Paoloni wrote: > > [...] > > > > I think the relevant spec is the PCI Firmware Spec, r3.0, sec 4.1.2. > > > Note 2 in that section says the address range of an MMCFG region > > > must be reserved by declaring a motherboard resource, i.e., included > > > in the _CRS of a PNP0C02 or similar device. > > > > I had a look a this. So yes the specs says that we should use the > > PNP0C02 device if MCFG is not supported. > > AFAIK, PNP0C02 is a resource reservation mechanism, the spec says that > MCFG regions must be reserved using PNP0C02, even though its > current usage on x86 is a bit unfathomable to me, in particular > in relation to MCFG resources retrieved for hotpluggable bridges (ie > through _CBA, which I think consider the MCFG region as reserved > by default, regardless of PNP0c02): > > see pci_mmcfg_check_reserved() arch/x86/pci/mmconfig-shared.c I don't know how _CBA-related resources would be reserved. I haven't personally worked with any host bridges that supply _CBA, so I don't know whether or how they handled it. I think the spec intent was that the Consumer/Producer bit (Extended Address Space Descriptor, General Flags Bit[0], see ACPI spec sec 6.4.3.5.4) would be used. Resources such as ECAM areas would be marked "Consumer", meaning the bridge consumed that space itself, and windows passed down to the PCI bus would be marked "Producer". But BIOSes didn't use that bit consistently, so we couldn't rely on it. I verified experimentally that Windows didn't pay attention to that bit either, at least for DWord descriptors: https://bugzilla.kernel.org/show_bug.cgi?id=15701 It's conceivable that we could still use that bit in Extended Address Space descriptors, or maybe some hack like pay attention if the bridge has _CBA, or some such. Or maybe a BIOS could add a PNP0C02 device along with the PNP0A03 device that uses _CBA, with the PNP0C02 _CRS describing the ECAM area referenced by _CBA. Seeems hacky no matter how we slice it. > Have a look at drivers/pnp/system.c for PNP0c02 > > > So probably I can use acpi_get_devices("PNP0C02",...) to retrieve it > > from the quirk match function, I will look into this... > > > > > > > > > On the other side, since this is an exception only for the config > > > > space address of our host controller (as said before all the buses > > > > below the root one support ECAM), I think that it is right to put > > > > this address as a device specific data (in fact the rest of the > > > > config space addresses will be parsed from MCFG). > > > > > > A kernel with no support for your host controller (and thus no > > > knowledge of its _DSD) should still be able to operate the rest of the > > > system correctly. That means we must have a generic way to learn what > > > address space is consumed by the host controller so we don't try to > > > assign it to other devices. > > > > This is something I don't understand much... > > Are you talking about a scenario where we have a Kernel image compiled > > without our host controller support and running on our platform? > > I *think* the point here is that your host controller config space should be > reserved through PNP0c02 so that the kernel will reserve it through the > generic PNP0c02 driver even if your host controller driver (and related > _DSD) is not supported in the kernel. Right. Assume you have two top-level devices: PNP0A03 PCI host bridge _CRS describes windows ???? describes ECAM space consumed PNPxxxx another ACPI device, currently disabled _PRS specifies possible resource settings, may specify no restrictions _SRS assign resources and enable device _CRS empty until device enabled When the OS enables PNPxxxx, it must first assign resources to it using _PRS and _SRS. We evaluate _PRS to find out what the addresses PNPxxxx can support. This tells us things like how wide the address decoder is, the size of the region required, and any alignment restrictions -- basically the same information we get by sizing a PCI BAR. Now, how do we assign space for PNPxxxx? In a few cases, _PRS has only a few specific possibilities, e.g., an x86 legacy serial port that can be at 0x3f8 or 0x2f8. But in general, _PRS need not impose any restrictions. So in general the OS can use any space that can be routed to PNPxxxx. If there's an upstream bridge, it may define windows that restrict the possibilities. But in this case, there *is* no upstream bridge, so the possible choices are the entire physical address space of the platform, except for other things that are already allocated: RAM, the _CRS settings for other ACPI devices, things reserved by the E820 table (at least on x86), etc. If PNP0A03 consumes address space for ECAM, that space must be reported *somewhere* so the OS knows not to place PNPxxxx there. This reporting must be generic (not device-specific like _DSD). The ACPI core (not drivers) is responsible for managing this address space because: a) the OS is not guaranteed to have drivers for all devices, and b) even it *did* have drivers for all devices, the PNPxxxx space may be assigned before drivers are initialized. > I do not understand how PNP0c02 works, currently, by the way. > > If I read x86 code correctly, the unassigned PCI bus resources are > assigned in arch/x86/pci/i386.c (?) fs_initcall(pcibios_assign_resources), > with a comment: > > /** > * called in fs_initcall (one below subsys_initcall), > * give a chance for motherboard reserve resources > */ > > Problem is, motherboard resources are requested through (?): > > drivers/pnp/system.c > > which is also initialized at fs_initcall, so it might be called after > core x86 code reassign resources, defeating the purpose PNP0c02 was > designed for, namely, request motherboard regions before resources > are assigned, am I wrong ? I think you're right. This is a long-standing screwup in Linux. IMHO, ACPI resources should be parsed and reserved by the ACPI core, before any PCI resource management (since PCI host bridges are represented in ACPI). But historically PCI devices have enumerated before ACPI got involved. And the ACPI core doesn't really pay attention to _CRS for most devices (with the exception of PNP0C02). IMO the PNP0C02 code in drivers/pnp/system.c should really be done in the ACPI core for all ACPI devices, similar to the way the PCI core reserves BAR space for all PCI devices, even if we don't have drivers for them. I've tried to fix this in the past, but it is really a nightmare to unravel everything. Because the ACPI core doesn't reserve resources for the _CRS of all ACPI devices, we're already vulnerable to the problem of placing a device on top of another ACPI device. We don't see problems because on x86, at least, most ACPI devices are already configured by the BIOS to be enabled and non-overlapping. But x86 has the advantage of having extensive test coverage courtesy of Windows, and as long as _CRS has the right stuff in it, we at least have the potential of fixing problems in Linux. If the platform doesn't report resource usage correctly on ARM, we may not find problems (because we don't have the Windows test suite) and if we have resource assignment problems because _CRS is lacking, we'll have no way to fix them. > As per last Tomasz's patchset, we claim and assign unassigned PCI > resources upon ACPI PCI host bridge probing (which happens at > subsys_initcall time, courtesy of ACPI current code); at that time the > kernel did not even register the PNP0c02 driver (drivers/pnp/system.c) > (it does that at fs_initcall). On the other hand, we insert MCFG > regions into the resource tree upon MCFG parsing, so I do not > see why we need to rely on PNP0c02 to do that for us (granted, the > mechanism is part of the PCI fw specs, which are x86 centric anyway > ie we can't certainly rely on Int15 e820 to detect reserved memory > on ARM :D) > > There is lots of legacy x86 here and Bjorn definitely has more > visibility into that than I have, the ARM world must understand > how this works to make sure we have an agreement. As you say, there is lots of unpleasant x86 legacy here. Possibly ARM has a chance to clean this up and do it more sanely; I'm not sure whether it's feasible to reverse the ACPI/PCI init order there or not. Rafael, any thoughts on this whole thing? Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html