On Sat, 13 Feb 2016 18:03:31 -0700 Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: > On Sat, 13 Feb 2016 19:20:32 -0500 > "Kevin O'Connor" <kevin@xxxxxxxxxxxx> wrote: > > > On Sat, Feb 13, 2016 at 01:57:09PM -0700, Alex Williamson wrote: > > > On Sat, 13 Feb 2016 15:05:09 -0500 > > > "Kevin O'Connor" <kevin@xxxxxxxxxxxx> wrote: > > > > On Sat, Feb 13, 2016 at 11:51:51AM -0700, Alex Williamson wrote: > > > > > On Sat, 13 Feb 2016 13:18:39 -0500 > > > > > "Kevin O'Connor" <kevin@xxxxxxxxxxxx> wrote: > > > > > > On Sat, Feb 13, 2016 at 08:12:09AM -0700, Alex Williamson wrote: > > > > > > > On Fri, 12 Feb 2016 21:49:04 -0500 > > > > > > > "Kevin O'Connor" <kevin@xxxxxxxxxxxx> wrote: > > > > > > > > On Fri, Feb 12, 2016 at 05:23:18PM -0700, Alex Williamson wrote: > > > > > > > > > Intel IGD makes use of memory allocated and marked reserved by the > > > > > > > > > BIOS as a stolen memory range. For the most part, guest drivers don't > > > > > > > > > make use of this, but our achilles heel is the vBIOS. The vBIOS > > > > > > > > > programs the device to use the host stolen memory range and it's used > > > > > > > > > in the pre-boot environment. Generally the guest won't have access to > > > > > > > > > the host stolen memory area, so these accesses either land in VM > > > > > > > > > memory or unassigned space and generate IOMMU faults. By allocating > > > > > > > > > this range in SeaBIOS and programming it into the device, QEMU (via > > > > > > > > > vfio) can make sure this guest allocated stolen memory range is used > > > > > > > > > instead. > > > > > > > > > > > > > > > > What does "vBIOS" mean in this context? Is it the video device option > > > > > > > > rom or something else? > > > > > > > > > > > > > > vBIOS = video BIOS, you're correct, it's just the video device option > > > > > > > ROM. > > > > > > > > > > > > Is the problem from when the host runs the video option rom, or is the > > > > > > problem from the guest (via SeaBIOS) running the video option rom? If > > > > > > the guest is running the option rom, is it the first time the option > > > > > > rom has been run for the machine (ie, was the option rom not executed > > > > > > on the host when the host machine first booted)? > > > > > > > > > > > > FWIW, many of the chromebooks use coreboot with Intel graphics and a > > > > > > number of users have installed SeaBIOS (running natively) on their > > > > > > machines. Running the intel video option rom more than once has been > > > > > > known to cause issues. > > > > > > > > > > The issue is in the VM and it occurs every time the option ROM is > > > > > executed. Standard VGA text mode displays fine (ex. SeaBIOS version > > > > > string and boot menu), but any sort of extended graphics mode (ex. live > > > > > CD boot menu) tries to make use of the host memory area which > > > > > corresponds to the stolen memory area of the physical device. We're > > > > > not really sure how the ROM execution arrives at these addresses (not > > > > > from the device according to access traces), but we can see when the > > > > > ROM is writing these addresses to the device and modify they addresses > > > > > to point at a VM memory range which we've allocated. That's what this > > > > > code attempts to do, allocate a buffer and tell QEMU about it via the > > > > > BDSM (Base Data of Stolen Memory) register. > > > > > > > > Forgive me if I'm not fully understanding this. If I read what you're > > > > saying then the sequence is something like: > > > > > > > > 1 - the host system bios (or vgabios) programs the GTT/stolen memory > > > > base register at host system bootup time and reserves it in the > > > > host e820 map. > > > > > > > > 2 - upon running qemu, the guest reruns the vga bios option rom which > > > > seems to work (ie, text mode works) > > > > > > > > 3 - in the guest, upon running a bootloader that uses graphics mode, > > > > the bootloader calls the vgabios to switch to graphics mode, and > > > > the vgabios sends commands to the graphics hardware that somehow > > > > reference the host stolen memory > > > > > > What exactly happens here isn't clear to me, but this is a plausible > > > explanation. What we see in tracing access to the hardware is that a > > > bunch of addresses are written to the device that fall within the host > > > e820 reserved area and then the device starts generating IOMMU faults > > > trying to access those addresses. > > > > > > > 4 - your patch causes QEMU to catch these commands with references to > > > > the host stolen memory and replace them with references to the > > > > guest stolen memory (which seabios creates) > > > > > > > > Am I understanding the above correctly? > > > > > > Yes. > > > > > > > Is the only reason to run the intel option rom in the guest for > > > > bootloader graphic mode support? Last time I looked, the intel vga > > > > hardware could fully emulate a legacy vga device - if the device is in > > > > vga compatibility mode then it may be possible to have seavgabios > > > > drive mode changes. > > > > > > I have a SandyBridge based laptop (Lenovo W520) where the LCD panel > > > won't turn on without the vBIOS. > > > > This confuses me - why didn't the host system BIOS turn on the LCD > > panel during host bootup? > > It turns off when we reset the device between VM instances or between > VM boots. IGD supports Function Level Reset (FLR). > > > >Another desktop IvyBridge system > > > doesn't really care about the vBIOS so long as we don't ask it to > > > output anything before the guest native drivers are loaded. If we > > > could, I think we'd just enable vBIOS for laptop panel support, but > > > that's really not an option, it's going to run as a boot option ROM as > > > well, so we need to fix the issues that it generates there. > > > > From my experience with coreboot, running the vga option rom multiple > > times during a given boot is very fragile. (By multiple times, I mean > > either the host running it and then a guest, or running it multiple > > times from multiple guests.) YMMV. > > We do this regularly for graphics assignment, Nvidia, AMD, and now > Intel. It generally works ok. Perhaps you've seen issues with the > option ROM being run multiple times without resetting the device. I > could certainly believe that. We only have one blacklisted Broadcom > ROM in vfio, probably due to missing or incomplete device reset method. > > > > > > > [...] > > > > > > > The write to 0x5C is used by QEMU code that traps writes to the > > > > > > > device I/O port BAR and replaces the host stolen memory address > > > > > > > with the new guest address generated here. 0x5C is initialized to > > > > > > > 0x0 by kernel vfio code, so we can detect whether it has been > > > > > > > written. If not written, QEMU has no place to redirect to for > > > > > > > stolen memory and it will either overlap VM memory or an unassigned > > > > > > > area. The former may corrupt VM memory, the latter throws host > > > > > > > errors. We could in QEMU halt with a hardware error if 0x5C hasn't > > > > > > > been programmed. > > > > > > > > > > > > So, if I understand correctly, 0x5C is not a "real" register on the > > > > > > hardware, but is instead just a mechanism to give QEMU the address of > > > > > > some guest visible ram? > > > > > > > > > > It is a real register, BDSM that is virtualized by vfio turning it > > > > > effectively into a scratch register. On physical hardware the > > > > > register is read-only. > > > > > > > > > > > BTW, is 0xFC a "real" register in the hardware? How does the guest > > > > > > find the location of the "OpRegion" if SeaBIOS allocates it? > > > > > > > > > > 0xFC is the ASL Storage register, the guest finds the location of the > > > > > OpRegion using this register. This is another register that is > > > > > read-only on real hardware but virtualized through vfio so we can > > > > > relocate the OpRegion into the VM address space. > > > > > > > > > > I've found that allocating a dummy MMIO BAR does work as an alternative > > > > > for mapping space for this stolen memory into the VM address space. > > > > > For a Linux guest it works to allocate BAR5 on the IGD device. > > > > > Windows10 is not so happy with this, but does work if I allocate the > > > > > BAR on something like the ISA bridge device. My guess is that the IGD > > > > > driver in Windows freaks out at finding this strange new BAR on its > > > > > device. So I'll need to come up with an algorithm for either creating > > > > > a dummy PCI device to host this BAR or trying to add it to other > > > > > existing devices. It's certainly a more self-contained solution this > > > > > way, so I expect we'll only need patch 1/3 from this series. Thanks, > > > > > > > > Okay. (I'm not saying patch 3 is bad, but okay.) > > > > > > > > If you go through the trouble of mapping the BDSM through a pci bar, > > > > then why not do the same with ASLS then too? > > > > > > I suppose we could do that. There are a few nuances to the fake BAR > > > solution: > > > > > > 1) The BAR needs to get mapped and not remapped while in use - usually > > > not a problem. > > > > > > 2) The guest needs to not disable the device we attach the BARs to, > > > which it might do if it doesn't recognize the device. > > > > > > 3) We need to be careful about adding BARs to devices the guest does > > > have drivers for or we might overlap real functionality. > > > > > > 4) If we create a dummy device with bogus IDs, it will show up with an > > > exclamation mark in device manager, which makes people unhappy. > > > > > > So from a perspective of being self contained, the fake BAR solution is > > > very good, but it's not without issue. I'll try to think of what sort > > > of dummy device we could create that would always have a guest driver, > > > but nothing that a couple extra BARs would interfere with. Maybe a > > > generic PCI bridge. Thanks, > > > > Okay. Again, I'm not stating a preferred direction. > > > > BTW, I wonder if the recent discussion between Michael and Igor is > > relevant here: > > https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg05602.html > > I'm certainly open to rebuttals against this approach, but I do have it > working. Being entirely self contained is pretty intriguing. > Theoretically this would allow us to work with OVMF with no > modifications. Linux guests enable and disable devices several times > during boot (per the spec, any time the BAR is sized it should be > disabled), Windows never seems to disable the device. The LPC/ISA > bridge seems to be the best place to put these BARs, we need to create > that anyway for pre-Broadwell/Skylake and the device itself has no > implemented BARs. The ISA bridge is just a shell device to keep the > driver happy on those older chips, so squatting on a couple BARs > doesn't seem too terrible. Thanks, I'm not sure what way Windows would behave wrt rebalancing if that fake bar is added to LPC/ISA bridge and I can think of a way to test and verify it either. In above discussion Michael dislikes that allocating GPA in QEMU, while I dislike stealing guest's RAM for just getting the same GPA. Perhaps it could better if we teach SeaBIOS/OVMF to allocate GPA without stealing guest's RAM somewhere in free address space and tell QEMU about it. Michael suggested to do something similar for stolen RAM in bios_linker interface only, which is a part of SeaBIOS/OVMF but maybe making it more generic PV interface might be even more useful. That way we won't have to implement fake PCI devices/BARs nor hijack existing PCI devices. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html