Thank you for the reply. I have been mulling it over for a while. On Thu, Jun 27, 2019 at 06:48:35PM +1000, Benjamin Herrenschmidt wrote: > On Thu, 2019-06-27 at 07:40 +0000, Nicholas Johnson wrote: > > Unfortunately, the operating system is designed to let the firmware do > > things. In my mind, ACPI should not need to exist, and the operating > > system should start with a clean state with PCI and re-enumerate > > everything at boot time. The PCI allocation is so broken and > > inconsistent (as you have noted) because it tries to combine the two, > > when firmware enumeration and native enumeration should be mutually > > exclusive. I have attempted to re-write large chunks of probe.c, pci.c > > and setup-bus.c to completely disregard firmware enumeration and clean > > everything up. Unfortunately, I get stuck in probe.c with the double > > recursive loop which assigns bus numbers - I cannot figure out how to > > re-write it successfully. Plus, I feel like nobody will be ready for > > such a drastic change - I am having trouble selling minor changes that > > fix actual use cases, as opposed to code reworking. > > Well... so a lot of platforms are happy to do a full re-assignment, > though they use the current code today which leads to rather sub > standard results when it comes to hotplug bridges. > > All the embedded platforms today are like that,and all of ARM64 though > the latter will somewhat change, all DT based ARM64 will probably > remain that way. > > > My next proposal might be a kernel parameter for PCI to set various > > levels of disregard for firmware > > Well, at least ACPI has this _DSM #5 thingy that can tell us that we > are allowed to disregard firmware for selected bits and pieces > (hopefully that tends to be whole hierarchies but I don't know how well > it's used in practice). I will need to find out more about this - can you suggest any particularly good resources on learning about ACPI? > > > , from none to complete, which can be > > added to incrementally to do more and more (rather than all in one patch > > series). > > So there are a number of reasons to honor what the firmware did. > > First, today (but that's fixable), we suck at setting up reasonable > space for hotplug by default. What annoys me more is that the BIOS vendors a) don't provide means to configure this in the BIOS, and if they do, it is hidden options which require you to re-flash the BIOS or use the dumped IFRs and EFI shell to modify the variables b) Even the few motherboards with the options for Thunderbolt available without resorting to (a) have it limited to 4096M. c) Motherboards are still cramming us into the 32-bit address space in case somebody is still using a 32-bit OS. There is the "above 4G decoding option" available on most motherboards, but I am not sure if that completely fixes the issue. Given that Microsoft said you need Windows 10 to run on the latest hardware, I do not see many people using 32-bit OS on the latest hardware. d) These options are especially needed because Windows cannot override anything whatsoever. Not even _OSC like pcie_ports=native on Linux. > > But there are more insidious ones. There are platforms where you can't > move things (typically virtualized platforms with specific hypervisors, > such as IBM pseries). I cannot argue with this. > > There are platforms where the *runtime* firwmare (SMM or equivalent or > even ACPI AML bits) will be poking at some system devices and those > really must not be moved. (In fact there's a theorical problem with > such devices becoming temporarily inaccessible during BAR sizing today > but we mostly get lucky). I think SMM is a nasty back door. Unfortunately the precident set is that the firmware makers can do what they want and we are expected to honour that in the kernel. In an ideal world, it would default to the OS assigning things and the firmware vendors getting blamed when things break if they insist on using runtime firmware. In my ideal world, motherboards would have the absolute bare minimum in BIOS to initialise DRAM and the tricky stuff, and then boot a CoreBoot Linux kernel off a MicroSD slot on the board. This could easily be updated constantly (for example, to add NVMe support to old boards) and it would be impossible to brick the motherboard by changing this, as the SD card could be removed and restored. This would fix the following: - No longer need for PCI option ROMs and their security issues - Open source / free firmware - Will not need firmware updates to add NVMe boot support - Allow target OS booted with kexec to assign resources as required - Set up IOMMU for Thunderbolt (and all DMA ports) at boot time without special BIOS updates required - Etc I am sure there are problems to what I am saying, but I do find it frustrating that the industry has the inability to move on from legacy to the massive extent that it does. When you have an arch, you expect that the same bytecode will run on the next system with that same arch. I don't understand why it stops there - I believe two systems of the same arch should be indistinguishable - without all of the firmware differences, and I hope to influence this during my career. > > There are other "interesting" cases, like EFI giving us the framebuffer > address to use if we don't have a native driver... which happens to be > off a PCI BAR somewhere. Now we *could* probably try to special case > that and detect when we move that BAR but today we'll probably break if > we move it. Also fixed by CoreBoot which will have the Linux kernel and all the drivers - no need for legacy services like this. > > x86 historically has other nasty "hidden" devices. There are historical > cases of devices that break if they move after initial setup, etc... > Most of these things are ancient but we have to ensure we keep today's > policy for old platforms at least. Sometimes I think that we need a fork of Linux. Although that would be the same as saying "for old systems, support ends on this kernel version and you are unlikely to need the new features of the latest kernels on oldest hardware". They did drop the older X86 recently, I believe. > > > This can supercede pci=realloc. The realloc command is so > > broken because once the system has loaded drivers, it becomes next to > > impossible to free and reallocate a resource to fit another device in - > > because it will upset existing devices. The realloc command is only > > useful in early boot because nothing is yet assigned, so it works. > > However, the same effect can be achieved by releasing all the resources > > on the root port before anything happens. I think it was > > pci_assign_unassigned_resources(), and I did verify this experimentally. > > This switch could be part of such a new kernel parameter to ignore > > firmware influence on PCI. > > We should see what ACPI gives us in _DSM #5 on x86 these days.. if it's > meaningful on enough machines we could use that as an indication that a > given tree can be reallocated. > > > I hope that somehow we can transition to ignoring the firmware - because > > firmware and native enumeration need to be mutually exclusive, and we > > need native enumeration for PCI hotplug. If anybody has any ideas how, I > > would love to hear. > > We'll probably have to live with an "in-between" forever on x86 and > maybe arm64, but with some luck, the static devices will only be the > on-board stuff, and we can go wild below bridges... The rest was just speculation and thoughts. My real question here is: What path do we have towards modernisation? We cannot replace the PCI code to handle everything natively and disregard the firmware for modern architectures like the emerging RISC-V because that code will screw up X86. So do we have to have pci-old and pci-new subsystems which can be elected by each arch? > > BTW: I'd like us to discuss that f2f at Plumbers in a miniconf if > enough of us can go. Please explain this as I have no idea what f2f, Plumbers and miniconf are. Cheers, Nicholas > > Cheers, > Ben. > >