(+Jiri, +libvir-list) On Fri, Nov 22, 2019 at 04:58:25PM +0000, Dr. David Alan Gilbert wrote: > * Laszlo Ersek (lersek@xxxxxxxxxx) wrote: > > (+Dave, +Eduardo) > > > > On 11/22/19 00:00, dann frazier wrote: > > > On Tue, Nov 19, 2019 at 06:06:15AM +0100, Laszlo Ersek wrote: > > >> On 11/19/19 01:54, dann frazier wrote: > > >>> On Fri, Nov 15, 2019 at 11:51:18PM +0100, Laszlo Ersek wrote: > > >>>> On 11/15/19 19:56, dann frazier wrote: > > >>>>> Hi, > > >>>>> I'm trying to passthrough an Nvidia GPU to a q35 KVM guest, but UEFI > > >>>>> is failing to allocate resources for it. I have no issues if I boot w/ > > >>>>> a legacy BIOS, and it works fine if I tell the linux guest to do the > > >>>>> allocation itself - but I'm looking for a way to make this work w/ > > >>>>> OVMF by default. > > >>>>> > > >>>>> I posted a debug log here: > > >>>>> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1849563/+attachment/5305740/+files/q35-uefidbg.log > > >>>>> > > >>>>> Linux guest lspci output is also available for both seabios/OVMF boots here: > > >>>>> https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1849563 > > >>>> > > >>>> By default, OVMF exposes such a 64-bit MMIO aperture for PCI MMIO BAR > > >>>> allocation that is 32GB in size. The generic PciBusDxe driver collects, > > >>>> orders, and assigns / allocates the MMIO BARs, but it can work only out > > >>>> of the aperture that platform code advertizes. > > >>>> > > >>>> Your GPU's region 1 is itself 32GB in size. Given that there are further > > >>>> PCI devices in the system with further 64-bit MMIO BARs, the default > > >>>> aperture cannot accommodate everything. In such an event, PciBusDxe > > >>>> avoids assigning the largest BARs (to my knowledge), in order to > > >>>> conserve the most aperture possible, for other devices -- hence break > > >>>> the fewest possible PCI devices. > > >>>> > > >>>> You can control the aperture size from the QEMU command line. You can > > >>>> also do it from the libvirt domain XML, technically speaking. The knob > > >>>> is experimental, so no stability or compatibility guarantees are made. > > >>>> (That's also the reason why it's a bit of a hack in the libvirt domain XML.) > > >>>> > > >>>> The QEMU cmdline options is described in the following edk2 commit message: > > >>>> > > >>>> https://github.com/tianocore/edk2/commit/7e5b1b670c38 > > >>> > > >>> Hi Laszlo, > > >>> > > >>> Thanks for taking the time to describe this in detail! The -fw_cfg > > >>> option did avoid the problem for me. > > >> > > >> Good to hear, thanks. > > >> > > >>> I also noticed that the above > > >>> commit message mentions the existence of a 24GB card as a reasoning > > >>> behind choosing the 32GB default aperture. From what you say below, I > > >>> understand that bumping this above 64GB could break hosts w/ <= 37 > > >>> physical address bits. > > >> > > >> Right. > > >> > > >>> What would be the downside of bumping the > > >>> default aperture to, say, 48GB? > > >> > > >> The placement of the aperture is not trivial (please see the code > > >> comments in the linked commit). The base address of the aperture is > > >> chosen so that the largest BAR that can fit in the aperture may be > > >> naturally aligned. (BARs are whole powers of two.) > > >> > > >> The largest BAR that can fit in a 48 GB aperture is 32 GB. Therefore > > >> such an aperture would be aligned at 32 GB -- the lowest base address > > >> (dependent on guest RAM size) would be 32 GB. Meaning that the aperture > > >> would end at 32 + 48 = 80 GB. That still breaches the 36-bit phys > > >> address width. > > >> > > >> 32 GB is the largest aperture size that can work with 36-bit phys > > >> address width; that's the aperture that ends at 64 GB exactly. > > > > > > Thanks, yeah - now that I read the code comments that is clear (as > > > clear as it can be w/ my low level of base knowledge). In the commit you > > > mention Gerd (CC'd) had suggested a heuristic-based approach for > > > sizing the aperture. When you say "PCPU address width" - is that a > > > function of the available physical bits? > > > > "PCPU address width" is not a "function" of the available physical bits > > -- it *is* the available physical bits. "PCPU" simply stands for > > "physical CPU". > > > > > IOW, would that approach > > > allow OVMF to automatically grow the aperture to the max ^2 supported > > > by the host CPU? > > > > Maybe. > > > > The current logic in OVMF works from the guest-physical address space > > size -- as deduced from multiple factors, such as the 64-bit MMIO > > aperture size, and others -- towards the guest-CPU (aka VCPU) address > > width. The VCPU address width is important for a bunch of other purposes > > in the firmware, so OVMF has to calculate it no matter what. > > > > Again, the current logic is to calculate the highest guest-physical > > address, and then deduce the VCPU address width from that (and then > > expose it to the rest of the firmware). > > > > Your suggestion would require passing the PCPU (physical CPU) address > > width from QEMU/KVM into the guest, and reversing the direction of the > > calculation. The PCPU address width would determine the VCPU address > > width directly, and then the 64-bit PCI MMIO aperture would be > > calculated from that. > > > > However, there are two caveats. > > > > (1) The larger your guest-phys address space (as exposed through the > > VCPU address width to the rest of the firmware), the more guest RAM you > > need for page tables. Because, just before entering the DXE phase, the > > firmware builds 1:1 mapping page tables for the entire guest-phys > > address space. This is necessary e.g. so you can access any PCI MMIO BAR. > > > > Now consider that you have a huge beefy virtualization host with say 46 > > phys address bits, and a wimpy guest with say 1.5GB of guest RAM. Do you > > absolutely want tens of *terabytes* for your 64-bit PCI MMIO aperture? > > Do you really want to pay for the necessary page tables with that meager > > guest RAM? > > > > (Such machines do exist BTW, for example: > > > > http://mid.mail-archive.com/9BD73EA91F8E404F851CF3F519B14AA8036C67B5@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx > > ) > > > > In other words, you'd need some kind of knob anyway, because otherwise > > your aperture could grow too *large*. > > > > > > (2) Exposing the PCPU address width to the guest may have nasty > > consequences at the QEMU/KVM level, regardless of guest firmware. For > > example, that kind of "guest enlightenment" could interfere with migration. > > > > If you boot a guest let's say with 16GB of RAM, and tell it "hey friend, > > have 40 bits of phys address width!", then you'll have a difficult time > > migrating that guest to a host with a CPU that only has 36-bits wide > > physical addresses -- even if the destination host has plenty of RAM > > otherwise, such as a full 64GB. > > > > There could be other QEMU/KVM / libvirt issues that I m unaware of > > (hence the CC to Dave and Eduardo). > > host physical address width gets messy. There are differences as well > between upstream qemu behaviour, and some downstreams. > I think the story is that: > > a) Qemu default: 40 bits on any host > b) -cpu blah,host-phys-bits=true to follow the host. > c) RHEL has host-phys-bits=true by default > > As you say, the only real problem with host-phys-bits is migration - > between say an E3 and an E5 xeon with different widths. The magic 40's > is generally wrong as well - I think it came from some ancient AMD, > but it's the default on QEMU TCG as well. Yes, and because it affects live migration ability, we have two constraints: 1) It needs to be exposed in the libvirt domain XML; 2) QEMU and libvirt can't choose a value that works for everybody (because neither QEMU or libvirt know where the VM might be migrated later). Which is why the BZ below is important: > > I don't think there's a way to set it in libvirt; > https://bugzilla.redhat.com/show_bug.cgi?id=1578278 is a bz asking for > that. > > IMHO host-phys-bits is actually pretty safe; and makes most sense in a > lot of cases. Yeah, it is mostly safe and makes sense, but messy if you try to migrate to a host with a different size. > > Dave > > > > Thanks, > > Laszlo > > > > > > > > -dann > > > > > >>>> For example, to set a 64GB aperture, pass: > > >>>> > > >>>> -fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=65536 > > >>>> > > >>>> The libvirt domain XML syntax is a bit tricky (and it might "taint" your > > >>>> domain, as it goes outside of the QEMU features that libvirt directly > > >>>> maps to): > > >>>> > > >>>> <domain > > >>>> type='kvm' > > >>>> xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> > > >>>> <qemu:commandline> > > >>>> <qemu:arg value='-fw_cfg'/> > > >>>> <qemu:arg value='opt/ovmf/X-PciMmio64Mb,string=65536'/> > > >>>> </qemu:commandline> > > >>>> </domain> > > >>>> > > >>>> Some notes: > > >>>> > > >>>> (1) The "xmlns:qemu" namespace definition attribute in the <domain> root > > >>>> element is important. You have to add it manually when you add > > >>>> <qemu:commandline> and <qemu:arg> too. Without the namespace > > >>>> definition, the latter elements will make no sense, and libvirt will > > >>>> delete them immediately. > > >>>> > > >>>> (2) The above change will grow your guest's physical address space to > > >>>> more than 64GB. As a consequence, on your *host*, *if* your physical CPU > > >>>> supports nested paging (called "ept" on Intel and "npt" on AMD), *then* > > >>>> the CPU will have to support at least 37 physical address bits too, for > > >>>> the guest to work. Otherwise, the guest will break, hard. > > >>>> > > >>>> Here's how to verify (on the host): > > >>>> > > >>>> (2a) run "egrep -w 'npt|ept' /proc/cpuinfo" --> if this does not produce > > >>>> output, then stop reading here; things should work. Your CPU does not > > >>>> support nested paging, so KVM will use shadow paging, which is slower, > > >>>> but at least you don't have to care about the CPU's phys address width. > > >>>> > > >>>> (2b) otherwise (i.e. when you do have nested paging), run "grep 'bits > > >>>> physical' /proc/cpuinfo" --> if the physical address width is >=37, > > >>>> you're good. > > >>>> > > >>>> (2c) if you have nested paging but exactly 36 phys address bits, then > > >>>> you'll have to forcibly disable nested paging (assuming you want to run > > >>>> a guest with larger than 64GB guest-phys address space, that is). On > > >>>> Intel, issue: > > >>>> > > >>>> rmmod kvm_intel > > >>>> modprobe kvm_intel ept=N > > >>>> > > >>>> On AMD, go with: > > >>>> > > >>>> rmmod kvm_amd > > >>>> modprobe kvm_amd npt=N > > >>>> > > >>>> Hope this helps, > > >>>> Laszlo > > >>>> > > >>> > > >> > > > > > > -- > Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK -- Eduardo -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list