On 06/28/16 12:04, Christoffer Dall wrote: > On Mon, Jun 27, 2016 at 03:57:28PM +0200, Ard Biesheuvel wrote: >> On 27 June 2016 at 15:35, Christoffer Dall <christoffer.dall@xxxxxxxxxx> wrote: >>> On Mon, Jun 27, 2016 at 02:30:46PM +0200, Ard Biesheuvel wrote: >>>> On 27 June 2016 at 12:34, Christoffer Dall <christoffer.dall@xxxxxxxxxx> wrote: >>>>> On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote: >>>>>> On 27 June 2016 at 11:16, Christoffer Dall <christoffer.dall@xxxxxxxxxx> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I'm going to ask some stupid questions here... >>>>>>> >>>>>>> On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote: >>>>>>>> Hi all, >>>>>>>> >>>>>>>> This old subject came up again in a discussion related to PCIe support >>>>>>>> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO >>>>>>>> regions as cacheable is preventing us from reusing a significant slice >>>>>>>> of the PCIe support infrastructure, and so I'd like to bring this up >>>>>>>> again, perhaps just to reiterate why we're simply out of luck. >>>>>>>> >>>>>>>> To refresh your memories, the issue is that on ARM, PCI MMIO regions >>>>>>>> for emulated devices may be backed by memory that is mapped cacheable >>>>>>>> by the host. Note that this has nothing to do with the device being >>>>>>>> DMA coherent or not: in this case, we are dealing with regions that >>>>>>>> are not memory from the POV of the guest, and it is reasonable for the >>>>>>>> guest to assume that accesses to such a region are not visible to the >>>>>>>> device before they hit the actual PCI MMIO window and are translated >>>>>>>> into cycles on the PCI bus. >>>>>>> >>>>>>> For the sake of completeness, why is this reasonable? >>>>>>> >>>>>> >>>>>> Because the whole point of accessing these regions is to communicate >>>>>> with the device. It is common to use write combining mappings for >>>>>> things like framebuffers to group writes before they hit the PCI bus, >>>>>> but any caching just makes it more difficult for the driver state and >>>>>> device state to remain synchronized. >>>>>> >>>>>>> Is this how any real ARM system implementing PCI would actually work? >>>>>>> >>>>>> >>>>>> Yes. >>>>>> >>>>>>>> That means that mapping such a region >>>>>>>> cacheable is a strange thing to do, in fact, and it is unlikely that >>>>>>>> patches implementing this against the generic PCI stack in Tianocore >>>>>>>> will be accepted by the maintainers. >>>>>>>> >>>>>>>> Note that this issue not only affects framebuffers on PCI cards, it >>>>>>>> also affects emulated USB host controllers (perhaps Alex can remind us >>>>>>>> which one exactly?) and likely other emulated generic PCI devices as >>>>>>>> well. >>>>>>>> >>>>>>>> Since the issue exists only for emulated PCI devices whose MMIO >>>>>>>> regions are backed by host memory, is there any way we can already >>>>>>>> distinguish such memslots from ordinary ones? If we can, is there >>>>>>>> anything we could do to treat these specially? Perhaps something like >>>>>>>> using read-only memslots so we can at least trap guest writes instead >>>>>>>> of having main memory going out of sync with the caches unnoticed? I >>>>>>>> am just brainstorming here ... >>>>>>> >>>>>>> I think the only sensible solution is to make sure that the guest and >>>>>>> emulation mappings use the same memory type, either cached or >>>>>>> non-cached, and we 'simply' have to find the best way to implement this. >>>>>>> >>>>>>> As Drew suggested, forcing some S2 mappings to be non-cacheable is the >>>>>>> one way. >>>>>>> >>>>>>> The other way is to use something like what you once wrote that rewrites >>>>>>> stage-1 mappings to be cacheable, does that apply here ? >>>>>>> >>>>>>> Do we have a clear picture of why we'd prefer one way over the other? >>>>>>> >>>>>> >>>>>> So first of all, let me reiterate that I could only find a single >>>>>> instance in QEMU where a PCI MMIO region is backed by host memory, >>>>>> which is vga-pci.c. I wonder of there are any other occurrences, but >>>>>> if there aren't any, it makes much more sense to prohibit PCI BARs >>>>>> backed by host memory rather than spend a lot of effort working around >>>>>> it. >>>>> >>>>> Right, ok. So Marc's point during his KVM Forum talk was basically, >>>>> don't use the legacy VGA adapter on ARM and use virtio graphics, right? >>>>> >>>> >>>> Yes. But nothing is preventing you currently from using that, and I >>>> think we should prefer crappy performance but correct operation over >>>> the current situation. So in general, we should either disallow PCI >>>> BARs backed by host memory, or emulate them, but never back them by a >>>> RAM memslot when running under ARM/KVM. >>> >>> agreed, I just think that emulating accesses by trapping them is not >>> just slow, it's not really possible in practice and even if it is, it's >>> probably *unusably* slow. >>> >> >> Well, it would probably involve a lot of effort to implement emulation >> of instructions with multiple output registers, such as ldp/stp and >> register writeback. And indeed, trapping on each store instruction to >> the framebuffer is going to be sloooooowwwww. >> >> So let's disregard that option for now ... >> >>>> >>>>> What is the proposed solution for someone shipping an ARM server and >>>>> wishing to provide a graphical output for that server? >>>>> >>>> >>>> The problem does not exist on bare metal. It is an implementation >>>> detail of KVM on ARM that guest PCI BAR mappings are incoherent with >>>> the view of the emulator in QEMU. >>>> >>>>> It feels strange to work around supporting PCI VGA adapters in ARM VMs, >>>>> if that's not a supported real hardware case. However, I don't see what >>>>> would prevent someone from plugging a VGA adapter into the PCI slot on >>>>> an ARM server, and people selling ARM servers probably want this to >>>>> happen, I'm guessing. >>>>> >>>> >>>> As I said, the problem does not exist on bare metal. >>>> >>>>>> >>>>>> If we do decide to fix this, the best way would be to use uncached >>>>>> attributes for the QEMU userland mapping, and force it uncached in the >>>>>> guest via a stage 2 override (as Drews suggests). The only problem I >>>>>> see here is that the host's kernel direct mapping has a cached alias >>>>>> that we need to get rid of. >>>>> >>>>> Do we have a way to accomplish that? >>>>> >>>>> Will we run into a bunch of other problems if we begin punching holes in >>>>> the direct mapping for regular RAM? >>>>> >>>> >>>> I think the policy up until now has been not to remap regions in the >>>> kernel direct mapping for the purposes of DMA, and I think by the same >>>> reasoning, it is not preferable for KVM either >>> >>> I guess the difference is that from the (host) kernel's point of view >>> this is not DMA memory, but just regular RAM. I just don't know enough >>> about the kernel's VM mappings to know what's involved here, but we >>> should find out somehow... >>> >> >> Whether it is DMA memory or not does not make a difference. The point >> is simply that arm64 maps all RAM owned by the kernel as cacheable, >> and remapping arbitrary ranges with different attributes is >> problematic, since it is also likely to involve splitting of regions, >> which is cumbersome with a mapping that is always live. >> >> So instead, we'd have to reserve some system memory early on and >> remove it from the linear mapping, the complexity of which is more >> than we are probably prepared to put up with. > > Don't we have any existing frameworks for such things, like ion or > other things like that? Not sure if these systems export anything to > userspace or even serve the purpose we want, but thought I'd throw it > out there. > >> >> So if vga-pci.c is the only problematic device, for which a reasonable >> alternative exists (virtio-gpu), I think the only feasible solution is >> to educate QEMU not to allow RAM memslots being exposed via PCI BARs >> when running under KVM/ARM. > > It would be good if we could support vga-pci under KVM/ARM, but if > there's no other way than rewriting the arm64 kernel's memory mappings > completely, then probably we're stuck there, unfortunately. It's been mentioned earlier that the specific combination of S1 and S2 mappings on aarch64 is actually an *architecture bug*. If we accept that qualification, then we should realize our efforts here target finding a *workaround*. In your blog post <http://www.linaro.org/blog/core-dump/on-the-performance-of-arm-virtualization/>, you mention VHE ("Virtualization Host Extensions"). That's clearly a sign of the architecture adapting to virt software needs. Do you see any chance that the S1-S2 combinations too can be fixed in a new revision of the architecture? Thanks Laszlo _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm