RE: [RFC PATCH 0/3] KVM: x86: honor guest memory type

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Fri, 21 Feb 2020 00:23:40 +0000

> From: Chia-I Wu <olvaffe@xxxxxxxxx>
> Sent: Friday, February 21, 2020 6:24 AM
> 
> On Wed, Feb 19, 2020 at 6:38 PM Tian, Kevin <kevin.tian@xxxxxxxxx> wrote:
> >
> > > From: Tian, Kevin
> > > Sent: Thursday, February 20, 2020 10:05 AM
> > >
> > > > From: Chia-I Wu <olvaffe@xxxxxxxxx>
> > > > Sent: Thursday, February 20, 2020 3:37 AM
> > > >
> > > > On Wed, Feb 19, 2020 at 1:52 AM Tian, Kevin <kevin.tian@xxxxxxxxx>
> wrote:
> > > > >
> > > > > > From: Paolo Bonzini
> > > > > > Sent: Wednesday, February 19, 2020 12:29 AM
> > > > > >
> > > > > > On 14/02/20 23:03, Sean Christopherson wrote:
> > > > > > >> On Fri, Feb 14, 2020 at 1:47 PM Chia-I Wu <olvaffe@xxxxxxxxx>
> > > wrote:
> > > > > > >>> AFAICT, it is currently allowed on ARM (verified) and AMD (not
> > > > > > >>> verified, but svm_get_mt_mask returns 0 which supposedly
> means
> > > > the
> > > > > > NPT
> > > > > > >>> does not restrict what the guest PAT can do).  This diff would do
> the
> > > > > > >>> trick for Intel without needing any uapi change:
> > > > > > >> I would be concerned about Intel CPU errata such as SKX40 and
> > > SKX59.
> > > > > > > The part KVM cares about, #MC, is already addressed by forcing
> UC
> > > for
> > > > > > MMIO.
> > > > > > > The data corruption issue is on the guest kernel to correctly use
> WC
> > > > > > > and/or non-temporal writes.
> > > > > >
> > > > > > What about coherency across live migration?  The userspace
> process
> > > > would
> > > > > > use cached accesses, and also a WBINVD could potentially corrupt
> guest
> > > > > > memory.
> > > > > >
> > > > >
> > > > > In such case the userspace process possibly should conservatively use
> > > > > UC mapping, as if for MMIO regions on a passthrough device.
> However
> > > > > there remains a problem. the definition of KVM_MEM_DMA implies
> > > > > favoring guest setting, which could be whatever type in concept. Then
> > > > > assuming UC is also problematic. I'm not sure whether inventing
> another
> > > > > interface to query effective memory type from KVM is a good idea.
> There
> > > > > is no guarantee that the guest will use same type for every page in the
> > > > > same slot, then such interface might be messy. Alternatively, maybe
> > > > > we could just have an interface for KVM userspace to force memory
> type
> > > > > for a given slot, if it is mainly used in para-virtualized scenarios (e.g.
> > > > > virtio-gpu) where the guest is enlightened to use a forced type (e.g.
> WC)?
> > > > KVM forcing the memory type for a given slot should work too.  But the
> > > > ignore-guest-pat bit seems to be Intel-specific.  We will need to
> > > > define how the second-level page attributes combine with the guest
> > > > page attributes somehow.
> > >
> > > oh, I'm not aware of that difference. without an ipat-equivalent
> > > capability, I'm not sure how to forcing random type here. If you look at
> > > table 11-7 in Intel SDM, none of MTRR (EPT) memory type can lead to
> > > consistent effective type when combining with random PAT value. So
> > >  it is definitely a dead end.
> > >
> > > >
> > > > KVM should in theory be able to tell that the userspace region is
> > > > mapped with a certain memory type and can force the same memory
> type
> > > > onto the guest.  The userspace does not need to be involved.  But that
> > > > sounds very slow?  This may be a dumb question, but would it help to
> > > > add KVM_SET_DMA_BUF and let KVM negotiate the memory type with
> the
> > > > in-kernel GPU drivers?
> > > >
> > > >
> > >
> > > KVM_SET_DMA_BUF looks more reasonable. But I guess we don't need
> > > KVM to be aware of such negotiation. We can continue your original
> > > proposal to have KVM simply favor guest memory type (maybe still call
> > > KVM_MEM_DMA). On the other hand, Qemu should just mmap on the
> > > fd handle of the dmabuf passed from the virtio-gpu device backend,  e.g.
> > > to conduct migration. That way the mmap request is finally served by
> > > DRM and underlying GPU drivers, with proper type enforced
> automatically.
> > >
> >
> > Thinking more possibly we don't need introduce new interface to KVM.
> > As long as Qemu uses dmabuf interface to mmap the specific region,
> > KVM can simply check memory type in host page table given hva of a
> > memslot. If the type is UC or WC, it implies that userspace wants a
> > non-coherent mapping which should be reflected in the guest side too.
> > In such case, KVM can go to non-cohenrent DMA path and favor guest
> > memory type automatically.
> Sorry, I mixed two things together.
> 
> Userspace access to dmabuf mmap must be guarded by
> DMA_BUF_SYNC_{START,END} ioctls.  It is possible that the GPU driver
> always picks a WB mapping and let the ioctls flush/invalidate CPU
> caches.  We actually want the guest memory type to match vkMapMemory's
> memory type, which can be different from dmabuf mmap's memory type.
> It is not enough for KVM to inspect the hva's memory type.

I'm not familiar with dmabuf and what is the difference between
vkMapMemory and mmap. Just a simple thought that whatever
memory type/synchronization enforced on the host userspace should
ideally be applied to guest userspace too. e.g. in above example we
possibly want the guest to use WB and issue flush/invalidate hypercalls
to guard with other potential parallel operations in the host side. 
otherwise I cannot see how synchronization can be done when one
use WB with sync primitives while the other simply use WC w/o such
primitives.

> 
> KVM_SET_DMA_BUF, if supported, is a signal to KVM that the guest
> memory type should be honored (or forced if there is a new op in
> dma_buf_ops that tells KVM which memory type to force).  KVM_MEM_DMA
> flag in this RFC sends the same signal.  Unless KVM_SET_DMA_BUF gives
> the userspace other features such as setting unlimited number of
> dmabufs to subregions of a memslot, it is not very useful.

the good part of a new interface is its simplicity, but only in slot
granularity. instead having KVM to inspect hva can support page
granularity, but adding run-time overhead. Let's see how Paolo
thinks. 😊

> 
> If uapi change is to be avoided, it is the easiest that guest memory
> type is always honored unless it causes #MC (i.e.,is_mmio==true).
> 

I feel this goes too far...

Thanks
Kevin