Re: [PATCH v3 1/6] kvm: determine memory type from VMA

Catalin Marinas <catalin.marinas@xxxxxxx> · Wed, 31 May 2023 12:35:57 +0100

On Wed, Apr 12, 2023 at 10:01:33AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 12, 2023 at 01:43:26PM +0100, Marc Zyngier wrote:
> > What makes it safe? How does VFIO ensures that the memory type used is
> > correct in all circumstances? This has to hold for *ANY* device, not
> > just your favourite toy of the day. Nothing in this patch is horribly
> > wrong, but the above question must be answered before we can consider
> > any of this.
> 
> In VFIO we now have the concept of "variant drivers" which work with
> specific PCI IDs. The variant drivers can inject device specific
> knowledge into VFIO. In this series the driver injects the cachable
> pgprot when it creates some of the VMAs because it knows the PCI IDs
> it supports, parses the ACPI description, and knows for sure that the
> memory it puts in the cachable VMA is linked with a cache coherent
> interconnect.
> 
> The generic vfio-pci path is not changed, so 'any device' is not
> relevant here.

There were several off-list discussions, I'm trying to summarise my
understanding here. This series aims to relax the VFIO mapping to
cacheable and have KVM map it into the guest with the same attributes.
Somewhat related past threads also tried to relax the KVM device
pass-through mapping from Device_nGnRnE (pgprot_noncached) to Normal_NC
(pgprot_writecombine). Those were initially using the PCIe prefetchable
BAR attribute but that's not a reliable means to infer whether Normal vs
Device is safe. Anyway, I think we'd need to unify these threads and
come up with some common handling that can cater for various attributes
required by devices/drivers. Therefore replying in this thread.

Current status on arm64: vfio-pci user mapping (Qemu) of PCIe BARs is
Device_nGnRnE. KVM also maps it at Stage 2 with the same attributes. The
user mapping is fine since the VMM may not have detailed knowledge about
how to safely map the BAR. KVM doesn't have such detailed knowledge
either in the device pass-through case but for safety reasons it follows
the same attributes as the user mapping.

>From a safety perspective, relaxing the KVM stage 2 mapping has some
implications:

1. It creates multiple memory aliases with different attributes.

2. Speculative loads, unaligned accesses.

3. Potentially new uncontained errors introduced by Normal NC vs Device
   mappings.

>From the private discussions I had regarding point (3), Device doesn't
really make any difference and it can be even worse. For example, Normal
mappings would allow a contained error while Device would trigger an
uncontained SError. So the only safe bet here is for the platform, root
complex to behave properly when device pass-through is involved (e.g.
PCIe to ignore writes, return 0xff on reads if there are errors). That's
something Arm is working on to require in the BSA specs (base system
architecture). Presumably cloud vendors allowing device pass-through
already know their system well enough, no need for further policing in
KVM (which it can't do properly anyway).

As long as the errors are contained, point (2) becomes the
responsibility of the guest, given its detailed knowledge of the device,
using the appropriate attributes (whether x86 WC maps exactly onto arm64
Normal NC is the subject of a separate discussion; so far that's the
assumption we took in the arm64 kernel). Regarding vfio-pci, the user
mapping must remain Device_nGnRnE on arm64 to avoid unexpected
speculative loads.

This leaves us with (1) and since the vfio-pci user mapping must remain
Device, there's a potential mismatched alias if the guest maps the
corresponding BAR as Normal NC. Luckily the Arm ARM constrains the
hardware behaviour here to basically allowing the Device mapping in the
VMM to behave somewhat the Normal NC mapping in the guest. IOW, the VMM
loses the Device guarantees (non-speculation, ordering). That's not a
problem for the device since the guest already deemed such mapping to be
safe.

While I think it's less likely for the VMM to access the same BAR that
was mapped into the guest, if we want to be on the safe side from an ABI
perspective (the returned mmap() now has different memory guarantees),
we could make this an opt-in. That's pretty much the VMM stating that it
is ok with losing some of the Device guarantees for the vfio-pci
mapping. A question here is whether we want to this to be done at the
vfio-pci level, the KVM memory slot or a big knob per VMM. Or whether we
need to bother with this at all (if it's just theoretical, maybe we can
have a global opt-out).

Going back to this series, allowing Cacheable mapping in KVM has similar
implications as above. (2) and (3) would be assumed safe by the platform
vendors. Regarding (1), to avoid confusion, I would only allow it if FWB
(force write-back) is present so that KVM can enforce a cacheable
mapping at Stage 2 if the vfio variant driver also maps it as cacheable
in user space.

There were some other thoughts on only allowing different attributes in
KVM if the user counterpart does not have any mapping (e.g. fd-based
KVM memory slots). However, this won't work well with CXL-attached
memory for example where the VMM may want access to it (e.g. virtio). So
I wouldn't spend more thoughts on this.

The TL;DR summary of what I think we should do:

a) Relax the KVM Stage 2 mapping for vfio-pci to Normal NC
   (pgprot_writecombine). TBD whether we need a VMM opt-in is at the
   vfio-pci level, KVM slot or per-VMM level. Another option is to do
   this by default and have a global opt-out (cmdline). I don't think
   the latter is such a bad idea since I find it unlikely for the VMM to
   also touch the same PCIe BAR _and_ expect different memory ordering
   guarantees than the guest device driver.

b) Map the KVM stage 2 mapping as cacheable+FWB iff the VMM has a
   corresponding cacheable mapping.

Thoughts?

-- 
Catalin