> Am 27.06.2016 um 15:57 schrieb Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>: > >> On 27 June 2016 at 15:35, Christoffer Dall <christoffer.dall@xxxxxxxxxx> wrote: >>> On Mon, Jun 27, 2016 at 02:30:46PM +0200, Ard Biesheuvel wrote: >>>> On 27 June 2016 at 12:34, Christoffer Dall <christoffer.dall@xxxxxxxxxx> wrote: >>>>> On Mon, Jun 27, 2016 at 11:47:18AM +0200, Ard Biesheuvel wrote: >>>>>> On 27 June 2016 at 11:16, Christoffer Dall <christoffer.dall@xxxxxxxxxx> wrote: >>>>>> Hi, >>>>>> >>>>>> I'm going to ask some stupid questions here... >>>>>> >>>>>>> On Fri, Jun 24, 2016 at 04:04:45PM +0200, Ard Biesheuvel wrote: >>>>>>> Hi all, >>>>>>> >>>>>>> This old subject came up again in a discussion related to PCIe support >>>>>>> for QEMU/KVM under Tianocore. The fact that we need to map PCI MMIO >>>>>>> regions as cacheable is preventing us from reusing a significant slice >>>>>>> of the PCIe support infrastructure, and so I'd like to bring this up >>>>>>> again, perhaps just to reiterate why we're simply out of luck. >>>>>>> >>>>>>> To refresh your memories, the issue is that on ARM, PCI MMIO regions >>>>>>> for emulated devices may be backed by memory that is mapped cacheable >>>>>>> by the host. Note that this has nothing to do with the device being >>>>>>> DMA coherent or not: in this case, we are dealing with regions that >>>>>>> are not memory from the POV of the guest, and it is reasonable for the >>>>>>> guest to assume that accesses to such a region are not visible to the >>>>>>> device before they hit the actual PCI MMIO window and are translated >>>>>>> into cycles on the PCI bus. >>>>>> >>>>>> For the sake of completeness, why is this reasonable? >>>>> >>>>> Because the whole point of accessing these regions is to communicate >>>>> with the device. It is common to use write combining mappings for >>>>> things like framebuffers to group writes before they hit the PCI bus, >>>>> but any caching just makes it more difficult for the driver state and >>>>> device state to remain synchronized. >>>>> >>>>>> Is this how any real ARM system implementing PCI would actually work? >>>>> >>>>> Yes. >>>>> >>>>>>> That means that mapping such a region >>>>>>> cacheable is a strange thing to do, in fact, and it is unlikely that >>>>>>> patches implementing this against the generic PCI stack in Tianocore >>>>>>> will be accepted by the maintainers. >>>>>>> >>>>>>> Note that this issue not only affects framebuffers on PCI cards, it >>>>>>> also affects emulated USB host controllers (perhaps Alex can remind us >>>>>>> which one exactly?) and likely other emulated generic PCI devices as >>>>>>> well. >>>>>>> >>>>>>> Since the issue exists only for emulated PCI devices whose MMIO >>>>>>> regions are backed by host memory, is there any way we can already >>>>>>> distinguish such memslots from ordinary ones? If we can, is there >>>>>>> anything we could do to treat these specially? Perhaps something like >>>>>>> using read-only memslots so we can at least trap guest writes instead >>>>>>> of having main memory going out of sync with the caches unnoticed? I >>>>>>> am just brainstorming here ... >>>>>> >>>>>> I think the only sensible solution is to make sure that the guest and >>>>>> emulation mappings use the same memory type, either cached or >>>>>> non-cached, and we 'simply' have to find the best way to implement this. >>>>>> >>>>>> As Drew suggested, forcing some S2 mappings to be non-cacheable is the >>>>>> one way. >>>>>> >>>>>> The other way is to use something like what you once wrote that rewrites >>>>>> stage-1 mappings to be cacheable, does that apply here ? >>>>>> >>>>>> Do we have a clear picture of why we'd prefer one way over the other? >>>>> >>>>> So first of all, let me reiterate that I could only find a single >>>>> instance in QEMU where a PCI MMIO region is backed by host memory, >>>>> which is vga-pci.c. I wonder of there are any other occurrences, but >>>>> if there aren't any, it makes much more sense to prohibit PCI BARs >>>>> backed by host memory rather than spend a lot of effort working around >>>>> it. >>>> >>>> Right, ok. So Marc's point during his KVM Forum talk was basically, >>>> don't use the legacy VGA adapter on ARM and use virtio graphics, right? >>> >>> Yes. But nothing is preventing you currently from using that, and I >>> think we should prefer crappy performance but correct operation over >>> the current situation. So in general, we should either disallow PCI >>> BARs backed by host memory, or emulate them, but never back them by a >>> RAM memslot when running under ARM/KVM. >> >> agreed, I just think that emulating accesses by trapping them is not >> just slow, it's not really possible in practice and even if it is, it's >> probably *unusably* slow. > > Well, it would probably involve a lot of effort to implement emulation > of instructions with multiple output registers, such as ldp/stp and > register writeback. And indeed, trapping on each store instruction to > the framebuffer is going to be sloooooowwwww. > > So let's disregard that option for now ... > >>> >>>> What is the proposed solution for someone shipping an ARM server and >>>> wishing to provide a graphical output for that server? >>> >>> The problem does not exist on bare metal. It is an implementation >>> detail of KVM on ARM that guest PCI BAR mappings are incoherent with >>> the view of the emulator in QEMU. >>> >>>> It feels strange to work around supporting PCI VGA adapters in ARM VMs, >>>> if that's not a supported real hardware case. However, I don't see what >>>> would prevent someone from plugging a VGA adapter into the PCI slot on >>>> an ARM server, and people selling ARM servers probably want this to >>>> happen, I'm guessing. >>> >>> As I said, the problem does not exist on bare metal. >>> >>>>> >>>>> If we do decide to fix this, the best way would be to use uncached >>>>> attributes for the QEMU userland mapping, and force it uncached in the >>>>> guest via a stage 2 override (as Drews suggests). The only problem I >>>>> see here is that the host's kernel direct mapping has a cached alias >>>>> that we need to get rid of. >>>> >>>> Do we have a way to accomplish that? >>>> >>>> Will we run into a bunch of other problems if we begin punching holes in >>>> the direct mapping for regular RAM? >>> >>> I think the policy up until now has been not to remap regions in the >>> kernel direct mapping for the purposes of DMA, and I think by the same >>> reasoning, it is not preferable for KVM either >> >> I guess the difference is that from the (host) kernel's point of view >> this is not DMA memory, but just regular RAM. I just don't know enough >> about the kernel's VM mappings to know what's involved here, but we >> should find out somehow... > > Whether it is DMA memory or not does not make a difference. The point > is simply that arm64 maps all RAM owned by the kernel as cacheable, > and remapping arbitrary ranges with different attributes is > problematic, since it is also likely to involve splitting of regions, > which is cumbersome with a mapping that is always live. > > So instead, we'd have to reserve some system memory early on and > remove it from the linear mapping, the complexity of which is more > than we are probably prepared to put up with. > > So if vga-pci.c is the only problematic device, for which a reasonable > alternative exists (virtio-gpu), I think the only feasible solution is > to educate QEMU not to allow RAM memslots being exposed via PCI BARs > when running under KVM/ARM. That's ok, if there is a viable alternative. So if we had working virtio-gpu support in OVMF, we could just disable the legacy vga device with kvm on arm altogether - it'd either crash your guest (unhandled opcode in mmio emulation) or give you broken graphics. But first, someone would need to sit down and make virtio-gpu work in OVMF. Alex _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm